diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index ba9ad2f..b2f68f9 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -30,7 +30,7 @@ _Last updated: 2026-05-21_ - [x] 01-01-PLAN.md — Docker Compose service topology + Postgres init + Pydantic Settings + requirements - [x] 01-02-PLAN.md — Wave 0 test scaffolds (xfail/skip stubs) + async pytest fixtures - [x] 01-03-PLAN.md — SQLAlchemy ORM models + async engine + Alembic async migration (incl. alembic upgrade head) -- [ ] 01-04-PLAN.md — StorageBackend ABC + MinIO backend + rewritten async services/storage.py +- [x] 01-04-PLAN.md — StorageBackend ABC + MinIO backend + rewritten async services/storage.py - [ ] 01-05-PLAN.md — Lifespan + /health + API cutover + Celery worker + walking-skeleton e2e verify --- @@ -110,7 +110,7 @@ _Last updated: 2026-05-21_ | Phase | Plans Complete | Status | Completed | |-------|----------------|--------|-----------| -| 1. Infrastructure Foundation | 3/5 | In Progress | - | +| 1. Infrastructure Foundation | 4/5 | In Progress | - | | 2. Users & Authentication | 0/? | Not started | - | | 3. Document Migration & Multi-User Isolation | 0/? | Not started | - | | 4. Folders, Sharing, Quotas & Document UX | 0/? | Not started | - | diff --git a/.planning/STATE.md b/.planning/STATE.md index f4872c8..2e503c6 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -4,13 +4,13 @@ milestone: v1.0 milestone_name: milestone current_phase: 1 status: executing -last_updated: "2026-05-22T08:30:00Z" +last_updated: "2026-05-22T07:40:00Z" progress: total_phases: 5 completed_phases: 0 total_plans: 5 - completed_plans: 3 - percent: 60 + completed_plans: 4 + percent: 80 --- # Project State @@ -24,7 +24,7 @@ progress: | Phase | Name | Status | |---|---|---| -| 1 | Infrastructure Foundation | In Progress (3/5 plans) | +| 1 | Infrastructure Foundation | In Progress (4/5 plans) | | 2 | Users & Authentication | Not Started | | 3 | Document Migration & Multi-User Isolation | Not Started | | 4 | Folders, Sharing, Quotas & Document UX | Not Started | @@ -33,10 +33,10 @@ progress: ## Current Position Phase: 1 (Infrastructure Foundation) — EXECUTING -Plan: 4 of 5 +Plan: 5 of 5 **Phase:** 01-infrastructure-foundation -**Plan:** 01-03 COMPLETE → advancing to 01-04 -**Progress:** ██████░░░░ 60% +**Plan:** 01-04 COMPLETE → advancing to 01-05 +**Progress:** ████████░░ 80% ## Performance Metrics @@ -45,7 +45,7 @@ Plan: 4 of 5 | Phases complete | 0 / 5 | | Requirements mapped | 54 / 54 | | Plans written | 5 (Phase 1) | -| Plans complete | 3 | +| Plans complete | 4 | ## Accumulated Context @@ -70,6 +70,10 @@ Plan: 4 of 5 | pydantic-settings v2 SettingsConfigDict | SettingsConfigDict API used (not deprecated class Config form) for env var config | | async_client fixture name | Distinct from legacy sync `client` fixture to avoid collision; both coexist until Plan 05 | | xfail(strict=False) for Wave 0 | All pre-implementation scaffolds use strict=False so unexpected passes don't break CI | +| StorageBackend ABC + factory mirrors ai/ pattern | 5 abstract methods; get_storage_backend() factory; MinIOBackend wraps all sync Minio SDK calls in asyncio.to_thread() | +| STORE-02 key enforced in code | MinIOBackend.put_object constructs {user_id}/{document_id}/{uuid4()}{ext}; no filename parameter — only extension passes through | +| null-user D-03 sentinel | services/storage.save_upload uses user_id="null-user" in Phase 1 (no auth); Phase 2 replaces with str(current_user.id) | +| load_settings flat-file Phase 1 | users.ai_provider/ai_model columns cannot be populated until Phase 2; settings remain flat-file JSON for Phase 1 | ### Open Questions @@ -88,6 +92,6 @@ _Updated at each phase transition._ | Field | Value | |---|---| -| Last session | 2026-05-22 — Executed 01-03-PLAN.md (SQLAlchemy ORM + Alembic migration; alembic upgrade head verified) | -| Next action | Execute 01-04-PLAN.md (StorageBackend ABC + MinIO backend + async services/storage.py) | +| Last session | 2026-05-22 — Executed 01-04-PLAN.md (StorageBackend ABC + MinIOBackend + async services/storage.py; all 6 test_storage.py xfail tests PASSED) | +| Next action | Execute 01-05-PLAN.md (Lifespan + /health + API cutover + Celery worker + walking-skeleton e2e verify) | | Pending decisions | See Open Questions above | diff --git a/.planning/phases/01-infrastructure-foundation/01-04-SUMMARY.md b/.planning/phases/01-infrastructure-foundation/01-04-SUMMARY.md new file mode 100644 index 0000000..8655157 --- /dev/null +++ b/.planning/phases/01-infrastructure-foundation/01-04-SUMMARY.md @@ -0,0 +1,207 @@ +--- +phase: 01-infrastructure-foundation +plan: 04 +subsystem: storage +tags: + - storage + - minio + - sqlalchemy + - service-layer + - abc + +# Dependency graph +requires: + - phase: 01-01 + provides: "Pydantic Settings (config.py) with minio_endpoint / minio_access_key / minio_secret_key / minio_bucket" + - phase: 01-02 + provides: "Wave 0 test scaffolds — test_storage.py with 6 xfail tests this plan unblocks" + - phase: 01-03 + provides: "backend/db/models.py (Document, Topic, DocumentTopic ORM models), backend/db/session.py (AsyncSessionLocal)" +provides: + - "backend/storage/base.py — StorageBackend ABC with 5 abstract methods" + - "backend/storage/minio_backend.py — MinIOBackend wrapping sync Minio SDK in asyncio.to_thread()" + - "backend/storage/__init__.py — get_storage_backend() factory" + - "backend/services/storage.py — async SQLAlchemy + MinIO orchestrator (14 async + 4 sync functions)" +affects: + - 01-05 + - "All subsequent plans that call storage service functions" + +# Tech tracking +tech-stack: + added: + - "minio SDK (Minio client, put_object, get_object, remove_object, presigned_get_object, bucket_exists)" + - "asyncio.to_thread() pattern for sync-in-async SDK wrapping (RESEARCH.md Pattern 3)" + - "StorageBackend ABC (abc.ABC, @abstractmethod)" + patterns: + - "ABC + factory pattern mirroring backend/ai/base.py + backend/ai/__init__.py" + - "asyncio.to_thread(self._client., ...) wraps every blocking Minio SDK call" + - "Lazy singleton _backend() function for single-process MinIOBackend instance" + - "STORE-02 key schema: {user_id}/{document_id}/{uuid4()}{ext} — filename never in key" + - "D-03 null-user sentinel: user_id='null-user' in Phase 1 (no auth)" + +key-files: + created: + - backend/storage/__init__.py + - backend/storage/base.py + - backend/storage/minio_backend.py + modified: + - backend/services/storage.py + - backend/tests/test_storage.py + - backend/tests/conftest.py + +key-decisions: + - "StorageBackend ABC has exactly 5 abstract methods: put_object, get_object, delete_object, presigned_get_url, health_check" + - "MinIOBackend.put_object takes no filename/original_name parameter — STORE-02 compliance; only extension (from Path.suffix.lower()) passes through" + - "get_object uses inner def _fetch() pattern to handle MinIO HTTPResponse.read() + close() + release_conn() inside asyncio.to_thread" + - "services/storage.py uses _backend() lazy singleton to avoid creating multiple Minio clients per-request" + - "load_settings/save_settings remain sync flat-file in Phase 1; comment documents Phase 2 migration to users.ai_provider/ai_model" + - "test_object_key_schema: db_session parameter removed (test does not need DB; SQLite cannot render PostgreSQL INET type during schema creation)" + - "conftest.py: filelock patching removed (isolated_data_dir fixture now only patches SETTINGS_FILE for sync settings functions)" + +# Metrics +duration: "~5 minutes" +completed: "2026-05-22" +--- + +# Phase 1 Plan 04: StorageBackend ABC + MinIO Backend + Async Services Rewrite — Summary + +**MinIO storage abstraction layer built (StorageBackend ABC + MinIOBackend + get_storage_backend() factory mirroring ai/ provider pattern); services/storage.py fully rewritten as async SQLAlchemy + MinIO orchestrator with filelock removed; all 6 Plan 02 test_storage.py xfail scaffolds flipped to PASSED.** + +## Performance + +- **Duration:** ~5 minutes +- **Started:** 2026-05-22T07:34:40Z +- **Completed:** 2026-05-22T07:39:42Z +- **Tasks:** 2 of 2 complete +- **Files created:** 3 +- **Files modified:** 3 + +## Accomplishments + +### Task 1: StorageBackend ABC + MinIOBackend + factory + +Created the `backend/storage/` module mirroring the `backend/ai/` provider pattern: + +- `backend/storage/base.py` — `StorageBackend(ABC)` with 5 `@abstractmethod async def` methods: `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check` +- `backend/storage/minio_backend.py` — `MinIOBackend(StorageBackend)` wrapping every synchronous Minio SDK call in `asyncio.to_thread(self._client., ...)`: + - `put_object`: generates STORE-02 key `{user_id}/{document_id}/{uuid4()}{ext}`, does NOT accept filename + - `get_object`: inner `_fetch()` helper handles MinIO's `HTTPResponse.read()` + `close()` + `release_conn()` + - `delete_object`: wraps `remove_object` + - `presigned_get_url`: wraps `presigned_get_object` with `timedelta(minutes=...)` + - `health_check`: wraps `bucket_exists` in try/except, returns bool +- `backend/storage/__init__.py` — `get_storage_backend()` factory returning `MinIOBackend(endpoint=settings.minio_endpoint, ...)` with `secure=False` for Docker internal HTTP + +### Task 2: Async services/storage.py rewrite + +Rewrote `backend/services/storage.py` from 188 lines (flat-file + filelock) to 473 lines (async SQLAlchemy + MinIO): + +#### Function signature changes (old → new) + +| Function | Old signature | New signature | +|----------|--------------|---------------| +| `save_upload` | `(file_bytes, original_name, mime_type) -> dict` | `async (session, file_bytes, original_name, mime_type) -> dict` | +| `save_metadata` | `(meta: dict) -> None` | `async (session, meta: dict) -> None` | +| `get_metadata` | `(doc_id: str) -> dict\|None` | `async (session, doc_id: str) -> Optional[dict]` | +| `list_metadata` | `(topic=None) -> list[dict]` | `async (session, topic=None) -> list` | +| `delete_document` | `(doc_id: str) -> bool` | `async (session, doc_id: str) -> bool` | +| `update_document_topics` | `(doc_id, topics) -> dict\|None` | `async (session, doc_id, topics) -> Optional[dict]` | +| `remove_topic_from_all_documents` | `(topic_name) -> int` | `async (session, topic_name) -> int` | +| `load_topics` | `() -> list[dict]` | `async (session) -> list` | +| `save_topics` | `(topics) -> None` | `async (session, topics) -> None` | +| `get_topic` | `(topic_id) -> dict\|None` | `async (session, topic_id) -> Optional[dict]` | +| `create_topic` | `(name, description, color) -> dict` | `async (session, name, description, color) -> dict` | +| `update_topic` | `(topic_id, **kwargs) -> dict\|None` | `async (session, topic_id, name=None, description=None, color=None) -> Optional[dict]` | +| `delete_topic` | `(topic_id) -> str\|None` | `async (session, topic_id) -> Optional[str]` | +| `topic_doc_counts` | `() -> dict[str, int]` | `async (session) -> dict` | +| `load_settings` | `() -> dict` | `def () -> dict` (unchanged — still sync flat-file) | +| `save_settings` | `(settings) -> None` | `def (settings) -> None` (unchanged — still sync flat-file) | +| `mask_api_key` | `(key) -> str` | `def (key) -> str` (unchanged) | +| `settings_masked` | `(settings) -> dict` | `def (settings) -> dict` (unchanged) | + +#### Key helpers introduced + +- `_backend()` — lazy singleton returning the `StorageBackend` instance (created once per process) +- `_doc_to_dict(doc, topic_names)` — converts a `Document` ORM row + topic names list to the legacy response dict shape +- `_load_topic_names(session, doc_id)` — async helper loading topic names for a given document UUID + +#### Sample object key for an upload + +``` +null-user/550e8400-e29b-41d4-a716-446655440000/f47ac10b-58cc-4372-a567-0e02b2c3d479.pdf +^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +user_id document_id (UUID) freshly-generated uuid4() + extension +(Phase 1 D-03 sentinel) +``` + +## Task Commits + +1. **Task 1: StorageBackend ABC + MinIOBackend + factory** — `eaf86a8` (feat) +2. **Task 2: Async services/storage.py rewrite** — `3e4b1f1` (feat) + +## Files Created/Modified + +| File | Type | Description | +|------|------|-------------| +| `backend/storage/base.py` | Created | StorageBackend ABC — 5 abstract async methods | +| `backend/storage/minio_backend.py` | Created | MinIOBackend — asyncio.to_thread wrapping; STORE-02 key schema | +| `backend/storage/__init__.py` | Created | get_storage_backend() factory | +| `backend/services/storage.py` | Rewritten | Async SQLAlchemy + MinIO orchestrator (473 lines) | +| `backend/tests/test_storage.py` | Modified | Removed xfail markers; fixed test_object_key_schema (removed unused db_session param) | +| `backend/tests/conftest.py` | Modified | Removed filelock patching; now only patches SETTINGS_FILE | + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 1 - Bug] Fixed test_object_key_schema: removed unused db_session parameter** +- **Found during:** Task 1 test run +- **Issue:** `test_object_key_schema(db_session)` had a `db_session` parameter it never used. The `db_session` fixture tries to create the full PostgreSQL schema (including `audit_log.ip_address INET`) in SQLite, which fails because SQLite has no `INET` type. +- **Fix:** Removed the `db_session` parameter from `test_object_key_schema` signature — the test body only uses `MinIOBackend.__new__` with mocks, not DB at all +- **Files modified:** `backend/tests/test_storage.py` +- **Commit:** `3e4b1f1` + +**2. [Rule 3 - Blocker] Updated conftest.py to remove filelock patching** +- **Found during:** Task 2 test run +- **Issue:** conftest.py `isolated_data_dir` fixture imported `from filelock import FileLock` and patched `st._topics_lock`, `st._settings_lock`, `st.UPLOADS_DIR`, `st.METADATA_DIR`, `st.TOPICS_FILE` on the storage module. After the rewrite, none of these attributes exist on `services.storage`. +- **Fix:** Replaced the 6 filelock `monkeypatch.setattr` calls with a single `monkeypatch.setattr(st, "SETTINGS_FILE", ...)` (the only flat-file attribute still on the module) +- **Files modified:** `backend/tests/conftest.py` +- **Commit:** `3e4b1f1` + +## Verification Results + +``` +tests/test_storage.py::test_object_key_schema PASSED +tests/test_storage.py::test_filename_not_in_object_key PASSED +tests/test_storage.py::test_storage_backend_abc_methods PASSED +tests/test_storage.py::test_get_storage_backend_returns_minio PASSED +tests/test_storage.py::test_put_object_uses_asyncio_to_thread PASSED +tests/test_storage.py::test_minio_backend_health_check_returns_bool PASSED +6 passed +``` + +## Known Stubs + +- `save_upload` uses `user_id="null-user"` as D-03 sentinel — Phase 2 will replace with `str(current_user.id)` after auth lands +- `load_settings` / `save_settings` remain flat-file backed — Phase 2 migrates to `users.ai_provider` / `users.ai_model` DB columns +- `Document.user_id` is `None` for all Phase 1 uploads — Phase 2 migration adds NOT NULL constraint + +## Threat Flags + +None. All surfaces from the plan's threat model are correctly handled: +- T-01-04-01 (object key prediction): key constructed server-side as `{user_id}/{document_id}/{uuid4()}{ext}`; no caller can inject a key +- T-01-04-02 (filename leaking): `MinIOBackend.put_object` has no `filename` parameter; tests `test_filename_not_in_object_key` + `test_object_key_schema` enforce this on every CI run +- T-01-04-04 (SQL injection): all queries via `select()`, `delete()`, `session.execute()` with parameterized values; no f-strings in SQL + +## Next Phase Readiness + +Plan 04 delivers the storage abstraction and async service layer. Plan 05 can now: +- Import `save_upload`, `get_metadata`, `list_metadata`, `delete_document` etc. from `services.storage` with async signatures +- Wire `api/documents.py` and `api/topics.py` callers from sync → `async def + await + session` parameter +- Wire the MinIO bucket initialization into the FastAPI lifespan +- Replace `BackgroundTasks` with Celery tasks using the async service layer + +**ROADMAP Phase 1 success criterion #3** (MinIO object key schema enforced in model layer) is now met. + +--- +*Phase: 01-infrastructure-foundation* +*Completed: 2026-05-22*