Files

208 lines
11 KiB
Markdown

---
phase: 01-infrastructure-foundation
plan: 04
subsystem: storage
tags:
- storage
- minio
- sqlalchemy
- service-layer
- abc
# Dependency graph
requires:
- phase: 01-01
provides: "Pydantic Settings (config.py) with minio_endpoint / minio_access_key / minio_secret_key / minio_bucket"
- phase: 01-02
provides: "Wave 0 test scaffolds — test_storage.py with 6 xfail tests this plan unblocks"
- phase: 01-03
provides: "backend/db/models.py (Document, Topic, DocumentTopic ORM models), backend/db/session.py (AsyncSessionLocal)"
provides:
- "backend/storage/base.py — StorageBackend ABC with 5 abstract methods"
- "backend/storage/minio_backend.py — MinIOBackend wrapping sync Minio SDK in asyncio.to_thread()"
- "backend/storage/__init__.py — get_storage_backend() factory"
- "backend/services/storage.py — async SQLAlchemy + MinIO orchestrator (14 async + 4 sync functions)"
affects:
- 01-05
- "All subsequent plans that call storage service functions"
# Tech tracking
tech-stack:
added:
- "minio SDK (Minio client, put_object, get_object, remove_object, presigned_get_object, bucket_exists)"
- "asyncio.to_thread() pattern for sync-in-async SDK wrapping (RESEARCH.md Pattern 3)"
- "StorageBackend ABC (abc.ABC, @abstractmethod)"
patterns:
- "ABC + factory pattern mirroring backend/ai/base.py + backend/ai/__init__.py"
- "asyncio.to_thread(self._client.<method>, ...) wraps every blocking Minio SDK call"
- "Lazy singleton _backend() function for single-process MinIOBackend instance"
- "STORE-02 key schema: {user_id}/{document_id}/{uuid4()}{ext} — filename never in key"
- "D-03 null-user sentinel: user_id='null-user' in Phase 1 (no auth)"
key-files:
created:
- backend/storage/__init__.py
- backend/storage/base.py
- backend/storage/minio_backend.py
modified:
- backend/services/storage.py
- backend/tests/test_storage.py
- backend/tests/conftest.py
key-decisions:
- "StorageBackend ABC has exactly 5 abstract methods: put_object, get_object, delete_object, presigned_get_url, health_check"
- "MinIOBackend.put_object takes no filename/original_name parameter — STORE-02 compliance; only extension (from Path.suffix.lower()) passes through"
- "get_object uses inner def _fetch() pattern to handle MinIO HTTPResponse.read() + close() + release_conn() inside asyncio.to_thread"
- "services/storage.py uses _backend() lazy singleton to avoid creating multiple Minio clients per-request"
- "load_settings/save_settings remain sync flat-file in Phase 1; comment documents Phase 2 migration to users.ai_provider/ai_model"
- "test_object_key_schema: db_session parameter removed (test does not need DB; SQLite cannot render PostgreSQL INET type during schema creation)"
- "conftest.py: filelock patching removed (isolated_data_dir fixture now only patches SETTINGS_FILE for sync settings functions)"
# Metrics
duration: "~5 minutes"
completed: "2026-05-22"
---
# Phase 1 Plan 04: StorageBackend ABC + MinIO Backend + Async Services Rewrite — Summary
**MinIO storage abstraction layer built (StorageBackend ABC + MinIOBackend + get_storage_backend() factory mirroring ai/ provider pattern); services/storage.py fully rewritten as async SQLAlchemy + MinIO orchestrator with filelock removed; all 6 Plan 02 test_storage.py xfail scaffolds flipped to PASSED.**
## Performance
- **Duration:** ~5 minutes
- **Started:** 2026-05-22T07:34:40Z
- **Completed:** 2026-05-22T07:39:42Z
- **Tasks:** 2 of 2 complete
- **Files created:** 3
- **Files modified:** 3
## Accomplishments
### Task 1: StorageBackend ABC + MinIOBackend + factory
Created the `backend/storage/` module mirroring the `backend/ai/` provider pattern:
- `backend/storage/base.py``StorageBackend(ABC)` with 5 `@abstractmethod async def` methods: `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check`
- `backend/storage/minio_backend.py``MinIOBackend(StorageBackend)` wrapping every synchronous Minio SDK call in `asyncio.to_thread(self._client.<method>, ...)`:
- `put_object`: generates STORE-02 key `{user_id}/{document_id}/{uuid4()}{ext}`, does NOT accept filename
- `get_object`: inner `_fetch()` helper handles MinIO's `HTTPResponse.read()` + `close()` + `release_conn()`
- `delete_object`: wraps `remove_object`
- `presigned_get_url`: wraps `presigned_get_object` with `timedelta(minutes=...)`
- `health_check`: wraps `bucket_exists` in try/except, returns bool
- `backend/storage/__init__.py``get_storage_backend()` factory returning `MinIOBackend(endpoint=settings.minio_endpoint, ...)` with `secure=False` for Docker internal HTTP
### Task 2: Async services/storage.py rewrite
Rewrote `backend/services/storage.py` from 188 lines (flat-file + filelock) to 473 lines (async SQLAlchemy + MinIO):
#### Function signature changes (old → new)
| Function | Old signature | New signature |
|----------|--------------|---------------|
| `save_upload` | `(file_bytes, original_name, mime_type) -> dict` | `async (session, file_bytes, original_name, mime_type) -> dict` |
| `save_metadata` | `(meta: dict) -> None` | `async (session, meta: dict) -> None` |
| `get_metadata` | `(doc_id: str) -> dict\|None` | `async (session, doc_id: str) -> Optional[dict]` |
| `list_metadata` | `(topic=None) -> list[dict]` | `async (session, topic=None) -> list` |
| `delete_document` | `(doc_id: str) -> bool` | `async (session, doc_id: str) -> bool` |
| `update_document_topics` | `(doc_id, topics) -> dict\|None` | `async (session, doc_id, topics) -> Optional[dict]` |
| `remove_topic_from_all_documents` | `(topic_name) -> int` | `async (session, topic_name) -> int` |
| `load_topics` | `() -> list[dict]` | `async (session) -> list` |
| `save_topics` | `(topics) -> None` | `async (session, topics) -> None` |
| `get_topic` | `(topic_id) -> dict\|None` | `async (session, topic_id) -> Optional[dict]` |
| `create_topic` | `(name, description, color) -> dict` | `async (session, name, description, color) -> dict` |
| `update_topic` | `(topic_id, **kwargs) -> dict\|None` | `async (session, topic_id, name=None, description=None, color=None) -> Optional[dict]` |
| `delete_topic` | `(topic_id) -> str\|None` | `async (session, topic_id) -> Optional[str]` |
| `topic_doc_counts` | `() -> dict[str, int]` | `async (session) -> dict` |
| `load_settings` | `() -> dict` | `def () -> dict` (unchanged — still sync flat-file) |
| `save_settings` | `(settings) -> None` | `def (settings) -> None` (unchanged — still sync flat-file) |
| `mask_api_key` | `(key) -> str` | `def (key) -> str` (unchanged) |
| `settings_masked` | `(settings) -> dict` | `def (settings) -> dict` (unchanged) |
#### Key helpers introduced
- `_backend()` — lazy singleton returning the `StorageBackend` instance (created once per process)
- `_doc_to_dict(doc, topic_names)` — converts a `Document` ORM row + topic names list to the legacy response dict shape
- `_load_topic_names(session, doc_id)` — async helper loading topic names for a given document UUID
#### Sample object key for an upload
```
null-user/550e8400-e29b-41d4-a716-446655440000/f47ac10b-58cc-4372-a567-0e02b2c3d479.pdf
^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
user_id document_id (UUID) freshly-generated uuid4() + extension
(Phase 1 D-03 sentinel)
```
## Task Commits
1. **Task 1: StorageBackend ABC + MinIOBackend + factory**`eaf86a8` (feat)
2. **Task 2: Async services/storage.py rewrite**`3e4b1f1` (feat)
## Files Created/Modified
| File | Type | Description |
|------|------|-------------|
| `backend/storage/base.py` | Created | StorageBackend ABC — 5 abstract async methods |
| `backend/storage/minio_backend.py` | Created | MinIOBackend — asyncio.to_thread wrapping; STORE-02 key schema |
| `backend/storage/__init__.py` | Created | get_storage_backend() factory |
| `backend/services/storage.py` | Rewritten | Async SQLAlchemy + MinIO orchestrator (473 lines) |
| `backend/tests/test_storage.py` | Modified | Removed xfail markers; fixed test_object_key_schema (removed unused db_session param) |
| `backend/tests/conftest.py` | Modified | Removed filelock patching; now only patches SETTINGS_FILE |
## Deviations from Plan
### Auto-fixed Issues
**1. [Rule 1 - Bug] Fixed test_object_key_schema: removed unused db_session parameter**
- **Found during:** Task 1 test run
- **Issue:** `test_object_key_schema(db_session)` had a `db_session` parameter it never used. The `db_session` fixture tries to create the full PostgreSQL schema (including `audit_log.ip_address INET`) in SQLite, which fails because SQLite has no `INET` type.
- **Fix:** Removed the `db_session` parameter from `test_object_key_schema` signature — the test body only uses `MinIOBackend.__new__` with mocks, not DB at all
- **Files modified:** `backend/tests/test_storage.py`
- **Commit:** `3e4b1f1`
**2. [Rule 3 - Blocker] Updated conftest.py to remove filelock patching**
- **Found during:** Task 2 test run
- **Issue:** conftest.py `isolated_data_dir` fixture imported `from filelock import FileLock` and patched `st._topics_lock`, `st._settings_lock`, `st.UPLOADS_DIR`, `st.METADATA_DIR`, `st.TOPICS_FILE` on the storage module. After the rewrite, none of these attributes exist on `services.storage`.
- **Fix:** Replaced the 6 filelock `monkeypatch.setattr` calls with a single `monkeypatch.setattr(st, "SETTINGS_FILE", ...)` (the only flat-file attribute still on the module)
- **Files modified:** `backend/tests/conftest.py`
- **Commit:** `3e4b1f1`
## Verification Results
```
tests/test_storage.py::test_object_key_schema PASSED
tests/test_storage.py::test_filename_not_in_object_key PASSED
tests/test_storage.py::test_storage_backend_abc_methods PASSED
tests/test_storage.py::test_get_storage_backend_returns_minio PASSED
tests/test_storage.py::test_put_object_uses_asyncio_to_thread PASSED
tests/test_storage.py::test_minio_backend_health_check_returns_bool PASSED
6 passed
```
## Known Stubs
- `save_upload` uses `user_id="null-user"` as D-03 sentinel — Phase 2 will replace with `str(current_user.id)` after auth lands
- `load_settings` / `save_settings` remain flat-file backed — Phase 2 migrates to `users.ai_provider` / `users.ai_model` DB columns
- `Document.user_id` is `None` for all Phase 1 uploads — Phase 2 migration adds NOT NULL constraint
## Threat Flags
None. All surfaces from the plan's threat model are correctly handled:
- T-01-04-01 (object key prediction): key constructed server-side as `{user_id}/{document_id}/{uuid4()}{ext}`; no caller can inject a key
- T-01-04-02 (filename leaking): `MinIOBackend.put_object` has no `filename` parameter; tests `test_filename_not_in_object_key` + `test_object_key_schema` enforce this on every CI run
- T-01-04-04 (SQL injection): all queries via `select()`, `delete()`, `session.execute()` with parameterized values; no f-strings in SQL
## Next Phase Readiness
Plan 04 delivers the storage abstraction and async service layer. Plan 05 can now:
- Import `save_upload`, `get_metadata`, `list_metadata`, `delete_document` etc. from `services.storage` with async signatures
- Wire `api/documents.py` and `api/topics.py` callers from sync → `async def + await + session` parameter
- Wire the MinIO bucket initialization into the FastAPI lifespan
- Replace `BackgroundTasks` with Celery tasks using the async service layer
**ROADMAP Phase 1 success criterion #3** (MinIO object key schema enforced in model layer) is now met.
---
*Phase: 01-infrastructure-foundation*
*Completed: 2026-05-22*