Research, pattern mapping, and verification complete. Walking Skeleton mode active (MVP Phase 1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
37 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous, requirements, user_setup, tags, must_haves
| phase | plan | type | wave | depends_on | files_modified | autonomous | requirements | user_setup | tags | must_haves | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 01-infrastructure-foundation | 04 | execute | 3 |
|
|
true |
|
|
|
Purpose: This is the heart of the walking skeleton. Without these modules, no document bytes can land in MinIO and no metadata can land in PostgreSQL. Plan 05 will compose this layer into the API + lifespan + Celery flow.
Output: Three new files under backend/storage/, one fully rewritten backend/services/storage.py, and all six Plan 02 tests/test_storage.py scaffolds flipping to PASSED.
<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>
@CLAUDE.md @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/01-infrastructure-foundation/01-CONTEXT.md @.planning/phases/01-infrastructure-foundation/01-RESEARCH.md @.planning/phases/01-infrastructure-foundation/01-PATTERNS.md @.planning/phases/01-infrastructure-foundation/SKELETON.md @.planning/phases/01-infrastructure-foundation/01-03-SUMMARY.md @backend/ai/base.py @backend/ai/__init__.py @backend/ai/openai_provider.py Project's established ABC + factory pattern (from `backend/ai/base.py` + `backend/ai/__init__.py`, lines 1-36 of each — read both files before implementing):# backend/ai/base.py — pattern to mirror
class AIProvider(ABC):
@abstractmethod
async def classify(...) -> ClassificationResult: ...
@abstractmethod
async def health_check(self) -> bool: ...
# backend/ai/__init__.py — factory pattern to mirror
def get_provider(settings: dict) -> AIProvider:
...
match active:
case "openai":
return OpenAIProvider(...)
case _:
raise ValueError(...)
Existing services/storage.py public surface that api/documents.py and api/topics.py and services/classifier.py currently call (sync):
save_upload(file_bytes, original_name, mime_type) -> dict(returns{"id", "filename", "path"})save_metadata(meta: dict) -> Noneget_metadata(doc_id: str) -> dict | Nonelist_metadata(topic: str | None = None) -> list[dict]delete_document(doc_id: str) -> boolupdate_document_topics(doc_id: str, topics: list[str]) -> dict | Noneremove_topic_from_all_documents(topic_name: str) -> intload_topics() -> list[dict]save_topics(topics: list[dict]) -> Noneget_topic(topic_id: str) -> dict | Nonecreate_topic(name: str, description: str = "", color: str = "#6366f1") -> dictupdate_topic(topic_id: str, **kwargs) -> dict | Nonedelete_topic(topic_id: str) -> str | None(returns the deleted topic name)topic_doc_counts() -> dict[str, int]load_settings() -> dictsave_settings(settings: dict) -> Nonemask_api_key(key: str) -> strsettings_masked(settings: dict) -> dict
New services/storage.py MUST keep all of those names but accept an AsyncSession parameter where DB access is needed, except mask_api_key and settings_masked (pure functions — keep sync).
Schema fields the service layer reads/writes (declared in backend/db/models.py from Plan 03):
Document.id, user_id, folder_id, filename, object_key, content_type, size_bytes, storage_backend, extracted_text, status, created_at, updated_atTopic.id, user_id (NULLABLE — Phase 1 has no users), name, description, colorDocumentTopic.document_id, topic_id(composite PK association table)
Settings persistence note: Phase 1 keeps services/storage.load_settings() / save_settings() reading the flat file SETTINGS_FILE (/app/data/settings.json) because the users.ai_provider / users.ai_model schema cannot be populated until Phase 2. Document this with a # Phase 2 will migrate this to DB-backed per-user settings (D-03 deferred to user-scoped column population) comment.
Create `backend/storage/base.py` with the exact contents from RESEARCH.md Pattern 8 (lines 612-640): `from abc import ABC, abstractmethod`, `class StorageBackend(ABC):` with five `@abstractmethod async def` methods — `put_object(self, user_id: str, document_id: str, file_bytes: bytes, extension: str, content_type: str) -> str`, `get_object(self, object_key: str) -> bytes`, `delete_object(self, object_key: str) -> None`, `presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str`, `health_check(self) -> bool`. Each abstract method has a one-line docstring per RESEARCH.md.
Create `backend/storage/minio_backend.py` implementing `MinIOBackend(StorageBackend)`. Imports: `import asyncio`, `import io`, `import uuid`, `from datetime import timedelta`, `from minio import Minio`, `from storage.base import StorageBackend`. Constructor `def __init__(self, endpoint: str, access_key: str, secret_key: str, bucket: str, secure: bool = False)` stores `self._bucket = bucket` and `self._client = Minio(endpoint=endpoint, access_key=access_key, secret_key=secret_key, secure=secure)`.
Implement `async def put_object(self, user_id: str, document_id: str, file_bytes: bytes, extension: str, content_type: str) -> str`: construct `object_key = f"{user_id}/{document_id}/{uuid.uuid4()}{extension}"`. Build `data = io.BytesIO(file_bytes)` (the constructor sets pointer at 0 — no `seek(0)` needed per Pitfall 3, but add an explicit `data.seek(0)` immediately afterward as a belt-and-braces safety). Call `await asyncio.to_thread(self._client.put_object, self._bucket, object_key, data, length=len(file_bytes), content_type=content_type)`. Return `object_key`. IMPORTANT: do NOT accept any `filename`/`original_name` parameter — STORE-02 requires the human filename never reaches this method.
Implement `async def get_object(self, object_key: str) -> bytes`: define a sync helper `def _fetch():` inside the method that calls `response = self._client.get_object(self._bucket, object_key)` then `try: return response.read() finally: response.close(); response.release_conn()`. Call `return await asyncio.to_thread(_fetch)`. This pattern correctly handles MinIO's HTTPResponse-returning `get_object`.
Implement `async def delete_object(self, object_key: str) -> None`: `await asyncio.to_thread(self._client.remove_object, self._bucket, object_key)`.
Implement `async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str`: `return await asyncio.to_thread(self._client.presigned_get_object, self._bucket, object_key, timedelta(minutes=expires_minutes))`.
Implement `async def health_check(self) -> bool`: wrap `await asyncio.to_thread(self._client.bucket_exists, self._bucket)` in a `try/except Exception: return False`; the truthy path returns the exact boolean from the SDK.
Finally, after writing the implementation, edit the six tests in `backend/tests/test_storage.py` to REMOVE the `@pytest.mark.xfail(strict=False, reason="implemented in plan 04")` markers (the tests should now PASS unmarked) — but keep the existing `try/except ImportError: pytest.skip(...)` blocks because they harmlessly fall through when imports succeed.
cd /Users/nik/Documents/Progamming/document_scanner/backend && python3 -m pytest tests/test_storage.py -v 2>&1 | tail -20 && python3 -c "
from storage.base import StorageBackend
from storage import get_storage_backend
from storage.minio_backend import MinIOBackend
assert isinstance(get_storage_backend(), MinIOBackend)
abstract = {m for m in StorageBackend.__abstractmethods__}
assert abstract == {'put_object','get_object','delete_object','presigned_get_url','health_check'}, f'abstract mismatch: {abstract}'
print('storage-abc-ok')
"
- `backend/storage/base.py` exists and contains `class StorageBackend(ABC):`
- `backend/storage/base.py` declares exactly 5 `@abstractmethod` methods (verifiable via `grep -c "@abstractmethod" backend/storage/base.py >= 5`)
- The 5 abstract method names are `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check` (verified by the inline `__abstractmethods__` assertion in the Verify command)
- `backend/storage/__init__.py` contains `def get_storage_backend() -> StorageBackend:` and returns a `MinIOBackend(endpoint=settings.minio_endpoint, ...)`
- `backend/storage/minio_backend.py` contains `class MinIOBackend(StorageBackend):`
- `backend/storage/minio_backend.py` calls `asyncio.to_thread(self._client.put_object`, `asyncio.to_thread(self._client.remove_object`, `asyncio.to_thread(self._client.presigned_get_object`, `asyncio.to_thread(self._client.bucket_exists` (each verifiable via grep — each must appear at least once)
- `backend/storage/minio_backend.py` contains the literal substring `f"{user_id}/{document_id}/{uuid.uuid4()}{extension}"` (the STORE-02 key format)
- `backend/storage/minio_backend.py` does NOT take any `filename` or `original_name` parameter on `put_object` (verifiable: `grep -E "def put_object.*filename|def put_object.*original_name" backend/storage/minio_backend.py` exits with no matches)
- `cd backend && python3 -m pytest tests/test_storage.py -v` exits 0 with all 6 tests reporting PASSED (no XFAIL, no SKIPPED, no FAILED) — verifiable via `python3 -m pytest tests/test_storage.py -v 2>&1 | grep -c "PASSED" >= 6`
- The `@pytest.mark.xfail` markers in `backend/tests/test_storage.py` are removed (verifiable: `grep -c "@pytest.mark.xfail" backend/tests/test_storage.py | grep -q "^0$"`)
The `storage/` module mirrors the `ai/` provider pattern exactly; `MinIOBackend` correctly wraps every sync SDK call in `asyncio.to_thread`; STORE-02 object key format is enforced in code; all six Plan 02 `test_storage.py` scaffolds pass.
Task 2: Rewrite backend/services/storage.py — async SQLAlchemy + MinIO orchestrator
backend/services/storage.py
- `save_upload(session, file_bytes, original_name, mime_type) -> dict`: creates a `Document` row with a freshly generated UUID `id`, `filename=original_name`, `content_type=mime_type`, `size_bytes=len(file_bytes)`, `user_id=None` (D-03 — no auth in Phase 1), `storage_backend="minio"`, `status="pending"`; calls `await get_storage_backend().put_object(user_id=str(user_id or "null-user"), document_id=str(doc.id), file_bytes=file_bytes, extension=Path(original_name).suffix.lower(), content_type=mime_type)`; stores the returned `object_key` on the `Document.object_key` column; commits; returns a dict shape compatible with the existing API: `{"id": str(doc.id), "filename": original_name, "path": object_key, "object_key": object_key}` (the `path` key is kept for compatibility with the current `api/documents.py` line 33 — it now holds the object_key, not a filesystem path)
- `save_metadata(session, meta: dict) -> None`: idempotent update of a `Document` row from the dict shape used by the legacy `api/documents.py` (`id`, `original_name`, `extracted_text`, `topics`, `classified_at`, etc.); maps `meta["original_name"] → Document.filename`, `meta["mime_type"] → Document.content_type`, `meta["size_bytes"] → Document.size_bytes`, `meta["extracted_text"] → Document.extracted_text`; `topics` list is reconciled into the `DocumentTopic` association table (delete existing rows for the document, insert one per topic name — auto-creating topics that don't yet exist via the existing `create_topic` helper); `classified_at → Document.updated_at`
- `get_metadata(session, doc_id: str) -> dict | None`: returns `None` for not-found; otherwise returns a dict matching the response shape `api/documents.py` produces today: `{"id", "original_name", "filename", "mime_type", "size_bytes", "extracted_text", "topics" (list of topic name strings), "created_at", "classified_at"}`
- `list_metadata(session, topic: str | None = None) -> list[dict]`: returns the same dict shape per row; if `topic` is provided, filters to documents joined to a `Topic` row whose `name == topic`
- `delete_document(session, doc_id: str) -> bool`: returns `False` if the document does not exist; otherwise calls `await get_storage_backend().delete_object(doc.object_key)`, deletes the `Document` row (cascade deletes `DocumentTopic` rows via the ORM relationship `ondelete="CASCADE"`), commits, returns `True`
- `update_document_topics(session, doc_id, topics) -> dict | None`: replaces the association rows and returns the refreshed metadata dict (or `None` for not-found)
- `remove_topic_from_all_documents(session, topic_name) -> int`: deletes all `DocumentTopic` rows whose `topic_id` matches the topic with that name; returns the count
- Topic functions (`load_topics`, `create_topic`, `update_topic`, `delete_topic`, `get_topic`, `topic_doc_counts`) are async and query the `Topic` and `DocumentTopic` tables; `create_topic` auto-deduplicates by lowercased name (preserving the existing behavior); response shape matches the existing JSON: `{"id", "name", "description", "color"}` per topic; topic `id` is the UUID stringified
- Settings functions (`load_settings`, `save_settings`, `mask_api_key`, `settings_masked`) continue to read/write the flat `SETTINGS_FILE` JSON in Phase 1 — these will move to the `users` table in Phase 2; document this with a comment
- NO `filelock` import remains in this module
- NO `json.loads` / `json.dumps` / `Path.read_text` / `Path.write_text` calls touch document or topic data (only the still-flat `SETTINGS_FILE` may keep using JSON)
- backend/services/storage.py (current 188-line file — read in full; understand every function signature and return shape so the rewrite preserves the contract)
- backend/db/models.py (Plan 03 output — exact column names and relationships)
- backend/api/documents.py (read in full — note exactly which `storage.*` calls are made and what return-shape fields are consumed)
- backend/api/topics.py (read in full — same scrutiny for topic-related calls)
- backend/services/classifier.py (read in full — uses `storage.get_metadata`, `storage.load_settings`, `storage.load_topics`, `storage.create_topic`, `storage.update_document_topics`)
- .planning/phases/01-infrastructure-foundation/01-PATTERNS.md (backend/services/storage.py section — explains the contract preservation strategy)
- .planning/phases/01-infrastructure-foundation/01-RESEARCH.md (Pitfall 1 — expire_on_commit=False; Pattern 1 — session lifecycle)
- .planning/phases/01-infrastructure-foundation/01-CONTEXT.md (D-04 data dir deleted, D-05 storage layer switched, D-06 object key schema)
Replace `backend/services/storage.py` entirely. New imports: `import uuid`, `import json`, `from datetime import datetime, timezone`, `from pathlib import Path`, `from sqlalchemy import select, func as sql_func, delete`, `from sqlalchemy.ext.asyncio import AsyncSession`, `from sqlalchemy.orm import selectinload`, `from db.models import Document, Topic, DocumentTopic`, `from storage import get_storage_backend`, `from config import settings as cfg_settings, DEFAULT_SETTINGS, SETTINGS_FILE`. Do NOT import `filelock` — that dependency is removed.
Add a module-level `_storage = None` and a `def _backend(): global _storage; _storage = _storage or get_storage_backend(); return _storage` helper so the MinIO backend is lazily instantiated once per process (matching the singleton-like behavior of the previous module-level filelock objects).
Implement every function below as `async def` (except `mask_api_key`, `settings_masked` which stay sync). The session is the first positional parameter for every DB-touching function.
1. `async def save_upload(session: AsyncSession, file_bytes: bytes, original_name: str, mime_type: str) -> dict`: generate `doc_id = uuid.uuid4()`. Extract `suffix = Path(original_name).suffix.lower()`. Build `doc = Document(id=doc_id, user_id=None, filename=original_name, content_type=mime_type, size_bytes=len(file_bytes), storage_backend="minio", status="pending", object_key="")`. `session.add(doc)`. `await session.flush()` (to materialize `doc.id` without committing). Call `object_key = await _backend().put_object(user_id="null-user", document_id=str(doc_id), file_bytes=file_bytes, extension=suffix, content_type=mime_type)` — the literal sentinel `"null-user"` is used because there is no user in Phase 1 (D-03); Phase 2 will replace this with `str(current_user.id)`. Set `doc.object_key = object_key`. `await session.commit()`. Return `{"id": str(doc_id), "filename": original_name, "path": object_key, "object_key": object_key, "user_id": None}`.
2. `async def save_metadata(session: AsyncSession, meta: dict) -> None`: look up `doc = await session.get(Document, uuid.UUID(meta["id"]))`. If not found, return. Set `doc.extracted_text = meta.get("extracted_text", "")`. If `meta.get("topics")` is a non-empty list of strings, call `await update_document_topics(session, meta["id"], meta["topics"])` (which handles association table reconciliation). Set `doc.status = "classified" if meta.get("classified_at") else "pending"`. `await session.commit()`.
3. `async def get_metadata(session: AsyncSession, doc_id: str) -> dict | None`: try `uid = uuid.UUID(doc_id)` inside try/except `ValueError: return None` (keep the legacy string-doc-id tolerance from `api/documents.py`). `doc = await session.get(Document, uid)`. If `None`, return `None`. Load topics: `topics_q = await session.execute(select(Topic.name).join(DocumentTopic, DocumentTopic.topic_id == Topic.id).where(DocumentTopic.document_id == uid))`; `topic_names = [r[0] for r in topics_q]`. Return `{"id": str(doc.id), "original_name": doc.filename, "filename": doc.filename, "mime_type": doc.content_type, "size_bytes": doc.size_bytes, "extracted_text": doc.extracted_text or "", "topics": topic_names, "created_at": doc.created_at.isoformat() if doc.created_at else None, "classified_at": doc.updated_at.isoformat() if doc.status == "classified" else None}`.
4. `async def list_metadata(session: AsyncSession, topic: str | None = None) -> list[dict]`: base query `stmt = select(Document).order_by(Document.created_at.desc())`. If `topic` is provided, join `DocumentTopic` and `Topic` and filter `Topic.name == topic`. Execute, then for each `Document` call the same dict-build code as `get_metadata` (factor it out into a private `def _doc_to_dict(doc, topic_names)` helper).
5. `async def delete_document(session: AsyncSession, doc_id: str) -> bool`: lookup as above; if not found, return `False`. Call `await _backend().delete_object(doc.object_key)` inside try/except (log to stderr on failure but still proceed with DB delete — the bytes may already be gone). `await session.delete(doc)`. `await session.commit()`. Return `True`.
6. `async def update_document_topics(session: AsyncSession, doc_id: str, topics: list[str]) -> dict | None`: lookup the doc; return `None` if missing. Delete all existing `DocumentTopic` rows for this document via `await session.execute(delete(DocumentTopic).where(DocumentTopic.document_id == uid))`. For each name in `topics` (deduped), call `topic_dict = await create_topic(session, name)` (auto-create if missing) and insert a `DocumentTopic(document_id=uid, topic_id=uuid.UUID(topic_dict["id"]))`. Set `doc.status = "classified"`. `await session.commit()`. Return `await get_metadata(session, doc_id)`.
7. `async def remove_topic_from_all_documents(session: AsyncSession, topic_name: str) -> int`: find the topic by name; if missing, return 0. Run `result = await session.execute(delete(DocumentTopic).where(DocumentTopic.topic_id == topic.id))`. `await session.commit()`. Return `result.rowcount`.
8. `async def load_topics(session: AsyncSession) -> list[dict]`: `q = await session.execute(select(Topic).order_by(Topic.name))`. Return `[{"id": str(t.id), "name": t.name, "description": t.description, "color": t.color} for t in q.scalars()]`.
9. `async def save_topics(session: AsyncSession, topics: list[dict]) -> None`: idempotent bulk replace — delete all `Topic` rows, then insert the provided list. (Phase 1 use: not directly called by any current endpoint; preserved for legacy compatibility — add a `# legacy: not used by current endpoints` comment.)
10. `async def get_topic(session: AsyncSession, topic_id: str) -> dict | None`: `t = await session.get(Topic, uuid.UUID(topic_id))`; return `None` or `{"id": str(t.id), ...}`.
11. `async def create_topic(session: AsyncSession, name: str, description: str = "", color: str = "#6366f1") -> dict`: case-insensitive deduplication — `q = await session.execute(select(Topic).where(sql_func.lower(Topic.name) == name.lower()))`. If a row exists, return its dict shape. Otherwise insert a new `Topic(name=name, description=description, color=color)`, commit, return dict.
12. `async def update_topic(session: AsyncSession, topic_id: str, name=None, description=None, color=None) -> dict | None`: lookup; if missing, `None`. Update only non-None fields. Commit. Return dict.
13. `async def delete_topic(session: AsyncSession, topic_id: str) -> str | None`: lookup; if missing, `None`. Save the name. Delete the Topic (cascade removes DocumentTopic rows via `ondelete="CASCADE"`). Commit. Return the saved name.
14. `async def topic_doc_counts(session: AsyncSession) -> dict[str, int]`: `q = await session.execute(select(Topic.name, sql_func.count(DocumentTopic.document_id)).join(DocumentTopic, DocumentTopic.topic_id == Topic.id, isouter=True).group_by(Topic.name))`. Return `{name: count for name, count in q}`.
15. `def load_settings() -> dict`: keep sync; read `SETTINGS_FILE` (still flat JSON) — if file missing, return a deep copy of `DEFAULT_SETTINGS`. Add comment `# Phase 2 will move per-user settings to users.ai_provider / users.ai_model`.
16. `def save_settings(settings: dict) -> None`: keep sync; write `SETTINGS_FILE`. No filelock — Phase 1 settings file is single-writer.
17. `def mask_api_key(key: str) -> str` and `def settings_masked(settings: dict) -> dict`: keep verbatim from current file (pure functions).
At the bottom of the file, expose every function via explicit `__all__ = ["save_upload", "save_metadata", "get_metadata", "list_metadata", "delete_document", "update_document_topics", "remove_topic_from_all_documents", "load_topics", "save_topics", "get_topic", "create_topic", "update_topic", "delete_topic", "topic_doc_counts", "load_settings", "save_settings", "mask_api_key", "settings_masked"]` so importers see the public surface.
cd /Users/nik/Documents/Progamming/document_scanner/backend && python3 -c "
import inspect
import services.storage as st
async_funcs = ['save_upload','save_metadata','get_metadata','list_metadata','delete_document','update_document_topics','remove_topic_from_all_documents','load_topics','save_topics','get_topic','create_topic','update_topic','delete_topic','topic_doc_counts']
sync_funcs = ['load_settings','save_settings','mask_api_key','settings_masked']
for f in async_funcs:
assert inspect.iscoroutinefunction(getattr(st, f)), f'{f} should be async'
for f in sync_funcs:
fn = getattr(st, f)
assert not inspect.iscoroutinefunction(fn), f'{f} should be sync'
assert 'filelock' not in open(st.__file__).read(), 'filelock import still present'
assert 'FileLock' not in open(st.__file__).read(), 'FileLock import still present'
print('services-storage-rewrite-ok')
"
- `backend/services/storage.py` no longer imports `filelock` or `FileLock` (verifiable: `grep -cE "from filelock|import filelock|FileLock" backend/services/storage.py | grep -q "^0$"`)
- `backend/services/storage.py` imports `from sqlalchemy.ext.asyncio import AsyncSession` and `from db.models import Document, Topic, DocumentTopic`
- `backend/services/storage.py` imports `from storage import get_storage_backend`
- All 14 DB-touching functions are `async def` (verified by the inline `inspect.iscoroutinefunction` assertions in the Verify command)
- All 4 pure functions (`load_settings`, `save_settings`, `mask_api_key`, `settings_masked`) remain sync
- `save_upload` constructs an object_key via the MinIO backend (verifiable: `grep -c "await.*put_object" backend/services/storage.py >= 1`)
- `save_upload` passes `user_id="null-user"` (D-03 sentinel) — verifiable via `grep -c 'user_id="null-user"\|user_id=.null-user.' backend/services/storage.py >= 1`
- The file declares `__all__ = [...]` listing all 18 public functions (verifiable: `grep -c "__all__" backend/services/storage.py >= 1`)
- The file has at least 120 lines (the `must_haves.artifacts.min_lines` target — verifiable: `wc -l backend/services/storage.py | awk '{exit ($1 >= 120) ? 0 : 1}'`)
- No `Path(...).read_text()` or `Path(...).write_text()` call references `METADATA_DIR`, `UPLOADS_DIR`, `TOPICS_FILE` (verifiable: `grep -cE "METADATA_DIR|UPLOADS_DIR|TOPICS_FILE" backend/services/storage.py | grep -q "^0$"`)
- `SETTINGS_FILE` references remain (for `load_settings`/`save_settings` — verifiable: `grep -c "SETTINGS_FILE" backend/services/storage.py >= 1`)
- `cd backend && python3 -c "import services.storage; print('import-ok')"` exits 0 with `import-ok`
- All previous Plan 02 `test_storage.py` tests still PASSED (the rewrite must not have removed the `MinIOBackend` exports or the factory) — `cd backend && python3 -m pytest tests/test_storage.py -v` exits 0 with 6 PASSED
`backend/services/storage.py` is fully async, all DB access goes through SQLAlchemy ORM, all object storage goes through the `MinIOBackend` via `get_storage_backend()`, no flat-file or filelock code remains for documents or topics. The public function names are preserved so Plan 05 can wire callers with `def → async def + await` plus a `session` parameter.
<threat_model>
Trust Boundaries
| Boundary | Description |
|---|---|
services/storage.py → storage/minio_backend.py |
Internal Python call; bytes never cross a network boundary inside this module |
storage/minio_backend.py → MinIO server |
App connects with MINIO_ACCESS_KEY (app-level credentials, not root) over Docker internal HTTP |
services/storage.py → PostgreSQL via AsyncSession |
All queries parameterized via SQLAlchemy ORM (SEC-03) |
STRIDE Threat Register
| Threat ID | Category | Component | Disposition | Mitigation Plan |
|---|---|---|---|---|
| T-01-04-01 | Tampering | Object key prediction / path traversal | mitigate | MinIOBackend.put_object constructs keys server-side as {user_id}/{document_id}/{uuid4()}{ext} (STORE-02, D-06); no caller can inject a key. The extension is derived from Path(original_name).suffix.lower() — the only piece of user input — and is concatenated as a suffix to a freshly generated UUID, so a malicious extension cannot escape the prefix (the leading {user_id}/{document_id}/{uuid4} portion is always server-controlled). Wave 0 tests test_object_key_schema + test_filename_not_in_object_key enforce this on every CI run. |
| T-01-04-02 | Information Disclosure | Human filename leaking into the object store namespace | mitigate | MinIOBackend.put_object does NOT accept a filename parameter (only extension). The human filename is stored only in the documents.filename DB column (D-06). Test test_filename_not_in_object_key asserts this on every run. |
| T-01-04-03 | Information Disclosure | Extension as a covert channel (e.g., .html for XSS via download) |
accept | Phase 1 has no download flow exposed to the browser (no presigned URL endpoint, no proxied download endpoint). Phase 4 (DOC-02) will introduce PDF-only proxied preview and content-type validation. Documented in SKELETON.md. |
| T-01-04-04 | Tampering | SQL injection via ORM string concatenation | mitigate | All queries use select(...), delete(...), session.execute(...) with parameterized values — never f-strings or .format() for SQL. The only string interpolation is the MinIO object key, which is not SQL. SEC-03 / CLAUDE.md. |
| T-01-04-05 | Denial of Service | Unbounded list_metadata returning all documents | accept | Phase 1 has only the single test user's documents (≤ 100 MB of uploads from D-04 reset state). Phase 3 (STORE-03) introduces pagination + quota enforcement. The existing pagination in api/documents.py (page, per_page) remains in effect — Plan 05 will wire it. |
| T-01-04-SC | Tampering | npm/pip/cargo installs | N/A | No new installs in this plan; uses minio/sqlalchemy from Plan 01. RESEARCH.md Package Legitimacy Audit covers both (slopcheck OK). |
| </threat_model> |
<success_criteria>
backend/storage/base.py,backend/storage/__init__.py,backend/storage/minio_backend.pyexist; the ABC + factory pattern mirrorsbackend/ai/exactly.backend/services/storage.pyis rewritten with async signatures, ORM access, and the MinIO backend integration;filelockis gone from this module.- All six Plan 02
tests/test_storage.pytests pass (xfail markers removed). - The legacy
tests/test_documents.pysync tests continue to pass — the cutover happens in Plan 05; this plan is additive on the storage layer side. </success_criteria>