--- phase: 01-infrastructure-foundation plan: 04 subsystem: storage tags: - storage - minio - sqlalchemy - service-layer - abc # Dependency graph requires: - phase: 01-01 provides: "Pydantic Settings (config.py) with minio_endpoint / minio_access_key / minio_secret_key / minio_bucket" - phase: 01-02 provides: "Wave 0 test scaffolds — test_storage.py with 6 xfail tests this plan unblocks" - phase: 01-03 provides: "backend/db/models.py (Document, Topic, DocumentTopic ORM models), backend/db/session.py (AsyncSessionLocal)" provides: - "backend/storage/base.py — StorageBackend ABC with 5 abstract methods" - "backend/storage/minio_backend.py — MinIOBackend wrapping sync Minio SDK in asyncio.to_thread()" - "backend/storage/__init__.py — get_storage_backend() factory" - "backend/services/storage.py — async SQLAlchemy + MinIO orchestrator (14 async + 4 sync functions)" affects: - 01-05 - "All subsequent plans that call storage service functions" # Tech tracking tech-stack: added: - "minio SDK (Minio client, put_object, get_object, remove_object, presigned_get_object, bucket_exists)" - "asyncio.to_thread() pattern for sync-in-async SDK wrapping (RESEARCH.md Pattern 3)" - "StorageBackend ABC (abc.ABC, @abstractmethod)" patterns: - "ABC + factory pattern mirroring backend/ai/base.py + backend/ai/__init__.py" - "asyncio.to_thread(self._client., ...) wraps every blocking Minio SDK call" - "Lazy singleton _backend() function for single-process MinIOBackend instance" - "STORE-02 key schema: {user_id}/{document_id}/{uuid4()}{ext} — filename never in key" - "D-03 null-user sentinel: user_id='null-user' in Phase 1 (no auth)" key-files: created: - backend/storage/__init__.py - backend/storage/base.py - backend/storage/minio_backend.py modified: - backend/services/storage.py - backend/tests/test_storage.py - backend/tests/conftest.py key-decisions: - "StorageBackend ABC has exactly 5 abstract methods: put_object, get_object, delete_object, presigned_get_url, health_check" - "MinIOBackend.put_object takes no filename/original_name parameter — STORE-02 compliance; only extension (from Path.suffix.lower()) passes through" - "get_object uses inner def _fetch() pattern to handle MinIO HTTPResponse.read() + close() + release_conn() inside asyncio.to_thread" - "services/storage.py uses _backend() lazy singleton to avoid creating multiple Minio clients per-request" - "load_settings/save_settings remain sync flat-file in Phase 1; comment documents Phase 2 migration to users.ai_provider/ai_model" - "test_object_key_schema: db_session parameter removed (test does not need DB; SQLite cannot render PostgreSQL INET type during schema creation)" - "conftest.py: filelock patching removed (isolated_data_dir fixture now only patches SETTINGS_FILE for sync settings functions)" # Metrics duration: "~5 minutes" completed: "2026-05-22" --- # Phase 1 Plan 04: StorageBackend ABC + MinIO Backend + Async Services Rewrite — Summary **MinIO storage abstraction layer built (StorageBackend ABC + MinIOBackend + get_storage_backend() factory mirroring ai/ provider pattern); services/storage.py fully rewritten as async SQLAlchemy + MinIO orchestrator with filelock removed; all 6 Plan 02 test_storage.py xfail scaffolds flipped to PASSED.** ## Performance - **Duration:** ~5 minutes - **Started:** 2026-05-22T07:34:40Z - **Completed:** 2026-05-22T07:39:42Z - **Tasks:** 2 of 2 complete - **Files created:** 3 - **Files modified:** 3 ## Accomplishments ### Task 1: StorageBackend ABC + MinIOBackend + factory Created the `backend/storage/` module mirroring the `backend/ai/` provider pattern: - `backend/storage/base.py` — `StorageBackend(ABC)` with 5 `@abstractmethod async def` methods: `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check` - `backend/storage/minio_backend.py` — `MinIOBackend(StorageBackend)` wrapping every synchronous Minio SDK call in `asyncio.to_thread(self._client., ...)`: - `put_object`: generates STORE-02 key `{user_id}/{document_id}/{uuid4()}{ext}`, does NOT accept filename - `get_object`: inner `_fetch()` helper handles MinIO's `HTTPResponse.read()` + `close()` + `release_conn()` - `delete_object`: wraps `remove_object` - `presigned_get_url`: wraps `presigned_get_object` with `timedelta(minutes=...)` - `health_check`: wraps `bucket_exists` in try/except, returns bool - `backend/storage/__init__.py` — `get_storage_backend()` factory returning `MinIOBackend(endpoint=settings.minio_endpoint, ...)` with `secure=False` for Docker internal HTTP ### Task 2: Async services/storage.py rewrite Rewrote `backend/services/storage.py` from 188 lines (flat-file + filelock) to 473 lines (async SQLAlchemy + MinIO): #### Function signature changes (old → new) | Function | Old signature | New signature | |----------|--------------|---------------| | `save_upload` | `(file_bytes, original_name, mime_type) -> dict` | `async (session, file_bytes, original_name, mime_type) -> dict` | | `save_metadata` | `(meta: dict) -> None` | `async (session, meta: dict) -> None` | | `get_metadata` | `(doc_id: str) -> dict\|None` | `async (session, doc_id: str) -> Optional[dict]` | | `list_metadata` | `(topic=None) -> list[dict]` | `async (session, topic=None) -> list` | | `delete_document` | `(doc_id: str) -> bool` | `async (session, doc_id: str) -> bool` | | `update_document_topics` | `(doc_id, topics) -> dict\|None` | `async (session, doc_id, topics) -> Optional[dict]` | | `remove_topic_from_all_documents` | `(topic_name) -> int` | `async (session, topic_name) -> int` | | `load_topics` | `() -> list[dict]` | `async (session) -> list` | | `save_topics` | `(topics) -> None` | `async (session, topics) -> None` | | `get_topic` | `(topic_id) -> dict\|None` | `async (session, topic_id) -> Optional[dict]` | | `create_topic` | `(name, description, color) -> dict` | `async (session, name, description, color) -> dict` | | `update_topic` | `(topic_id, **kwargs) -> dict\|None` | `async (session, topic_id, name=None, description=None, color=None) -> Optional[dict]` | | `delete_topic` | `(topic_id) -> str\|None` | `async (session, topic_id) -> Optional[str]` | | `topic_doc_counts` | `() -> dict[str, int]` | `async (session) -> dict` | | `load_settings` | `() -> dict` | `def () -> dict` (unchanged — still sync flat-file) | | `save_settings` | `(settings) -> None` | `def (settings) -> None` (unchanged — still sync flat-file) | | `mask_api_key` | `(key) -> str` | `def (key) -> str` (unchanged) | | `settings_masked` | `(settings) -> dict` | `def (settings) -> dict` (unchanged) | #### Key helpers introduced - `_backend()` — lazy singleton returning the `StorageBackend` instance (created once per process) - `_doc_to_dict(doc, topic_names)` — converts a `Document` ORM row + topic names list to the legacy response dict shape - `_load_topic_names(session, doc_id)` — async helper loading topic names for a given document UUID #### Sample object key for an upload ``` null-user/550e8400-e29b-41d4-a716-446655440000/f47ac10b-58cc-4372-a567-0e02b2c3d479.pdf ^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ user_id document_id (UUID) freshly-generated uuid4() + extension (Phase 1 D-03 sentinel) ``` ## Task Commits 1. **Task 1: StorageBackend ABC + MinIOBackend + factory** — `eaf86a8` (feat) 2. **Task 2: Async services/storage.py rewrite** — `3e4b1f1` (feat) ## Files Created/Modified | File | Type | Description | |------|------|-------------| | `backend/storage/base.py` | Created | StorageBackend ABC — 5 abstract async methods | | `backend/storage/minio_backend.py` | Created | MinIOBackend — asyncio.to_thread wrapping; STORE-02 key schema | | `backend/storage/__init__.py` | Created | get_storage_backend() factory | | `backend/services/storage.py` | Rewritten | Async SQLAlchemy + MinIO orchestrator (473 lines) | | `backend/tests/test_storage.py` | Modified | Removed xfail markers; fixed test_object_key_schema (removed unused db_session param) | | `backend/tests/conftest.py` | Modified | Removed filelock patching; now only patches SETTINGS_FILE | ## Deviations from Plan ### Auto-fixed Issues **1. [Rule 1 - Bug] Fixed test_object_key_schema: removed unused db_session parameter** - **Found during:** Task 1 test run - **Issue:** `test_object_key_schema(db_session)` had a `db_session` parameter it never used. The `db_session` fixture tries to create the full PostgreSQL schema (including `audit_log.ip_address INET`) in SQLite, which fails because SQLite has no `INET` type. - **Fix:** Removed the `db_session` parameter from `test_object_key_schema` signature — the test body only uses `MinIOBackend.__new__` with mocks, not DB at all - **Files modified:** `backend/tests/test_storage.py` - **Commit:** `3e4b1f1` **2. [Rule 3 - Blocker] Updated conftest.py to remove filelock patching** - **Found during:** Task 2 test run - **Issue:** conftest.py `isolated_data_dir` fixture imported `from filelock import FileLock` and patched `st._topics_lock`, `st._settings_lock`, `st.UPLOADS_DIR`, `st.METADATA_DIR`, `st.TOPICS_FILE` on the storage module. After the rewrite, none of these attributes exist on `services.storage`. - **Fix:** Replaced the 6 filelock `monkeypatch.setattr` calls with a single `monkeypatch.setattr(st, "SETTINGS_FILE", ...)` (the only flat-file attribute still on the module) - **Files modified:** `backend/tests/conftest.py` - **Commit:** `3e4b1f1` ## Verification Results ``` tests/test_storage.py::test_object_key_schema PASSED tests/test_storage.py::test_filename_not_in_object_key PASSED tests/test_storage.py::test_storage_backend_abc_methods PASSED tests/test_storage.py::test_get_storage_backend_returns_minio PASSED tests/test_storage.py::test_put_object_uses_asyncio_to_thread PASSED tests/test_storage.py::test_minio_backend_health_check_returns_bool PASSED 6 passed ``` ## Known Stubs - `save_upload` uses `user_id="null-user"` as D-03 sentinel — Phase 2 will replace with `str(current_user.id)` after auth lands - `load_settings` / `save_settings` remain flat-file backed — Phase 2 migrates to `users.ai_provider` / `users.ai_model` DB columns - `Document.user_id` is `None` for all Phase 1 uploads — Phase 2 migration adds NOT NULL constraint ## Threat Flags None. All surfaces from the plan's threat model are correctly handled: - T-01-04-01 (object key prediction): key constructed server-side as `{user_id}/{document_id}/{uuid4()}{ext}`; no caller can inject a key - T-01-04-02 (filename leaking): `MinIOBackend.put_object` has no `filename` parameter; tests `test_filename_not_in_object_key` + `test_object_key_schema` enforce this on every CI run - T-01-04-04 (SQL injection): all queries via `select()`, `delete()`, `session.execute()` with parameterized values; no f-strings in SQL ## Next Phase Readiness Plan 04 delivers the storage abstraction and async service layer. Plan 05 can now: - Import `save_upload`, `get_metadata`, `list_metadata`, `delete_document` etc. from `services.storage` with async signatures - Wire `api/documents.py` and `api/topics.py` callers from sync → `async def + await + session` parameter - Wire the MinIO bucket initialization into the FastAPI lifespan - Replace `BackgroundTasks` with Celery tasks using the async service layer **ROADMAP Phase 1 success criterion #3** (MinIO object key schema enforced in model layer) is now met. --- *Phase: 01-infrastructure-foundation* *Completed: 2026-05-22*