6fed5ba531
Research, pattern mapping, and verification complete. Walking Skeleton mode active (MVP Phase 1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
361 lines
37 KiB
Markdown
361 lines
37 KiB
Markdown
---
|
|
phase: 01-infrastructure-foundation
|
|
plan: 04
|
|
type: execute
|
|
wave: 3
|
|
depends_on:
|
|
- 01-03
|
|
files_modified:
|
|
- backend/storage/__init__.py
|
|
- backend/storage/base.py
|
|
- backend/storage/minio_backend.py
|
|
- backend/services/storage.py
|
|
autonomous: true
|
|
requirements:
|
|
- STORE-01
|
|
- STORE-02
|
|
- STORE-07
|
|
user_setup: []
|
|
tags:
|
|
- storage
|
|
- minio
|
|
- sqlalchemy
|
|
- service-layer
|
|
- abc
|
|
|
|
must_haves:
|
|
truths:
|
|
- "`backend/storage/base.py` declares `StorageBackend` ABC with five abstract methods matching the contract in Plan 02's test_storage.py"
|
|
- "`backend/storage/__init__.py` exports a `get_storage_backend()` factory returning a `MinIOBackend` instance configured from `config.settings`"
|
|
- "`backend/storage/minio_backend.py` implements all five `StorageBackend` methods; every call to the synchronous `Minio` SDK is wrapped in `asyncio.to_thread()`"
|
|
- "`MinIOBackend.put_object` produces object keys matching `{user_id}/{document_id}/{uuid4()}{ext}` exactly (D-06, STORE-02)"
|
|
- "The original human filename is never passed into the MinIO SDK and never appears in the returned object key"
|
|
- "`backend/services/storage.py` is entirely rewritten — no `filelock`, no flat-file I/O, no `json.loads/dumps` on metadata; every function is `async def` and accepts an `AsyncSession`"
|
|
- "The new `services/storage.py` preserves the existing function names so `api/documents.py` and `api/topics.py` can adopt the async signatures with minimal code change in Plan 05"
|
|
- "All six Plan 02 `tests/test_storage.py` tests flip from XFAIL to PASSED after this plan ships"
|
|
artifacts:
|
|
- path: "backend/storage/base.py"
|
|
provides: "StorageBackend ABC mirroring backend/ai/base.py"
|
|
contains: "class StorageBackend(ABC)"
|
|
- path: "backend/storage/__init__.py"
|
|
provides: "get_storage_backend() factory mirroring backend/ai/__init__.py"
|
|
contains: "def get_storage_backend"
|
|
- path: "backend/storage/minio_backend.py"
|
|
provides: "MinIO implementation of StorageBackend with asyncio.to_thread wrapping"
|
|
contains: "class MinIOBackend(StorageBackend)"
|
|
min_lines: 80
|
|
- path: "backend/services/storage.py"
|
|
provides: "Async PostgreSQL+MinIO orchestrator replacing the flat-file implementation"
|
|
contains: "from db.models import Document"
|
|
min_lines: 120
|
|
key_links:
|
|
- from: "backend/storage/__init__.py"
|
|
to: "backend/config.settings"
|
|
via: "factory reads minio_endpoint / minio_access_key / minio_secret_key / minio_bucket"
|
|
pattern: "settings\\.minio_(endpoint|access_key|secret_key|bucket)"
|
|
- from: "backend/storage/minio_backend.py"
|
|
to: "asyncio.to_thread"
|
|
via: "every put_object/get_object/delete_object/presigned/bucket_exists call"
|
|
pattern: "asyncio\\.to_thread\\(self\\._client\\."
|
|
- from: "backend/services/storage.py"
|
|
to: "backend/db/models"
|
|
via: "Document, Topic, DocumentTopic ORM queries"
|
|
pattern: "Document|Topic|DocumentTopic"
|
|
- from: "backend/services/storage.py"
|
|
to: "backend/storage.get_storage_backend"
|
|
via: "module-level call to obtain the configured StorageBackend"
|
|
pattern: "get_storage_backend"
|
|
---
|
|
|
|
<objective>
|
|
Build the storage abstraction layer (StorageBackend ABC + MinIO implementation following the established `backend/ai/` provider pattern) AND rewrite `backend/services/storage.py` to use async SQLAlchemy ORM + MinIO instead of flat-file JSON + filelock (D-05). Preserve the existing public function names (`save_upload`, `save_metadata`, `get_metadata`, `list_metadata`, `delete_document`, `update_document_topics`, `load_topics`, `save_topics`, `get_topic`, `create_topic`, `update_topic`, `delete_topic`, `topic_doc_counts`, `load_settings`, `save_settings`, `mask_api_key`, `settings_masked`) but change every function signature to `async def` that accepts an `AsyncSession` as the first parameter where ORM access is required. This plan does NOT yet wire the API routes — Plan 05 swaps callers from sync to async.
|
|
|
|
Purpose: This is the heart of the walking skeleton. Without these modules, no document bytes can land in MinIO and no metadata can land in PostgreSQL. Plan 05 will compose this layer into the API + lifespan + Celery flow.
|
|
|
|
Output: Three new files under `backend/storage/`, one fully rewritten `backend/services/storage.py`, and all six Plan 02 `tests/test_storage.py` scaffolds flipping to PASSED.
|
|
</objective>
|
|
|
|
<execution_context>
|
|
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
|
|
@$HOME/.claude/get-shit-done/templates/summary.md
|
|
</execution_context>
|
|
|
|
<context>
|
|
@CLAUDE.md
|
|
@.planning/PROJECT.md
|
|
@.planning/ROADMAP.md
|
|
@.planning/STATE.md
|
|
@.planning/phases/01-infrastructure-foundation/01-CONTEXT.md
|
|
@.planning/phases/01-infrastructure-foundation/01-RESEARCH.md
|
|
@.planning/phases/01-infrastructure-foundation/01-PATTERNS.md
|
|
@.planning/phases/01-infrastructure-foundation/SKELETON.md
|
|
@.planning/phases/01-infrastructure-foundation/01-03-SUMMARY.md
|
|
@backend/ai/base.py
|
|
@backend/ai/__init__.py
|
|
@backend/ai/openai_provider.py
|
|
|
|
<interfaces>
|
|
Project's established ABC + factory pattern (from `backend/ai/base.py` + `backend/ai/__init__.py`, lines 1-36 of each — read both files before implementing):
|
|
|
|
```python
|
|
# backend/ai/base.py — pattern to mirror
|
|
class AIProvider(ABC):
|
|
@abstractmethod
|
|
async def classify(...) -> ClassificationResult: ...
|
|
@abstractmethod
|
|
async def health_check(self) -> bool: ...
|
|
|
|
# backend/ai/__init__.py — factory pattern to mirror
|
|
def get_provider(settings: dict) -> AIProvider:
|
|
...
|
|
match active:
|
|
case "openai":
|
|
return OpenAIProvider(...)
|
|
case _:
|
|
raise ValueError(...)
|
|
```
|
|
|
|
Existing `services/storage.py` public surface that `api/documents.py` and `api/topics.py` and `services/classifier.py` currently call (sync):
|
|
- `save_upload(file_bytes, original_name, mime_type) -> dict` (returns `{"id", "filename", "path"}`)
|
|
- `save_metadata(meta: dict) -> None`
|
|
- `get_metadata(doc_id: str) -> dict | None`
|
|
- `list_metadata(topic: str | None = None) -> list[dict]`
|
|
- `delete_document(doc_id: str) -> bool`
|
|
- `update_document_topics(doc_id: str, topics: list[str]) -> dict | None`
|
|
- `remove_topic_from_all_documents(topic_name: str) -> int`
|
|
- `load_topics() -> list[dict]`
|
|
- `save_topics(topics: list[dict]) -> None`
|
|
- `get_topic(topic_id: str) -> dict | None`
|
|
- `create_topic(name: str, description: str = "", color: str = "#6366f1") -> dict`
|
|
- `update_topic(topic_id: str, **kwargs) -> dict | None`
|
|
- `delete_topic(topic_id: str) -> str | None` (returns the deleted topic name)
|
|
- `topic_doc_counts() -> dict[str, int]`
|
|
- `load_settings() -> dict`
|
|
- `save_settings(settings: dict) -> None`
|
|
- `mask_api_key(key: str) -> str`
|
|
- `settings_masked(settings: dict) -> dict`
|
|
|
|
New `services/storage.py` MUST keep all of those names but accept an `AsyncSession` parameter where DB access is needed, except `mask_api_key` and `settings_masked` (pure functions — keep sync).
|
|
|
|
Schema fields the service layer reads/writes (declared in `backend/db/models.py` from Plan 03):
|
|
- `Document.id, user_id, folder_id, filename, object_key, content_type, size_bytes, storage_backend, extracted_text, status, created_at, updated_at`
|
|
- `Topic.id, user_id (NULLABLE — Phase 1 has no users), name, description, color`
|
|
- `DocumentTopic.document_id, topic_id` (composite PK association table)
|
|
|
|
Settings persistence note: Phase 1 keeps `services/storage.load_settings()` / `save_settings()` reading the flat file `SETTINGS_FILE` (`/app/data/settings.json`) because the `users.ai_provider` / `users.ai_model` schema cannot be populated until Phase 2. Document this with a `# Phase 2 will migrate this to DB-backed per-user settings (D-03 deferred to user-scoped column population)` comment.
|
|
</interfaces>
|
|
</context>
|
|
|
|
<tasks>
|
|
|
|
<task type="auto" tdd="true">
|
|
<name>Task 1: Create backend/storage/ — StorageBackend ABC + factory + MinIOBackend implementation</name>
|
|
<files>backend/storage/__init__.py, backend/storage/base.py, backend/storage/minio_backend.py</files>
|
|
<behavior>
|
|
- Defining a subclass of `StorageBackend` that omits any of the five abstract methods (`put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check`) and trying to instantiate it raises `TypeError`
|
|
- `get_storage_backend()` returns a `MinIOBackend` instance whose endpoint/access_key/secret_key/bucket come from `config.settings.minio_*`
|
|
- `await backend.put_object(user_id="u1", document_id="d1", file_bytes=b"abc", extension=".txt", content_type="text/plain")` returns a string of the form `u1/d1/<uuid4>.txt`
|
|
- The middle UUID segment of the returned key is a freshly generated `uuid4()` value — not the user_id and not the document_id
|
|
- The human filename is NEVER passed into the MinIO SDK call — only the constructed object key, content type, length, and `io.BytesIO(file_bytes)` are passed
|
|
- Every call to the synchronous `Minio` SDK (`put_object`, `get_object`, `remove_object`, `presigned_get_object`, `bucket_exists`, `make_bucket`) goes through `asyncio.to_thread(self._client.<method>, ...)`
|
|
- `await backend.health_check()` returns `True` when `bucket_exists` returns `True`; returns `False` when `bucket_exists` raises any exception
|
|
- `await backend.get_object(key)` returns the file bytes; under the hood it calls `self._client.get_object(...)` (which returns an `HTTPResponse`), then `response.read()`, then `response.close()` + `response.release_conn()` — all inside `asyncio.to_thread`
|
|
</behavior>
|
|
<read_first>
|
|
- backend/ai/base.py (the ABC pattern to mirror — read in full)
|
|
- backend/ai/__init__.py (the factory pattern to mirror — read in full)
|
|
- backend/ai/openai_provider.py (the concrete ABC implementation pattern — `__init__`, `_client` attribute, async methods, try/except `health_check`)
|
|
- .planning/phases/01-infrastructure-foundation/01-RESEARCH.md (Pattern 3 lines 343-419 — MinIO sync-in-async; Pattern 8 lines 607-651 — StorageBackend ABC code; Pitfall 3 — BytesIO seek(0); Code Examples lines 920-928 — config fields)
|
|
- .planning/phases/01-infrastructure-foundation/01-PATTERNS.md (backend/storage/* sections)
|
|
- backend/tests/test_storage.py (Plan 02 output — the six xfail tests defining the contract this plan must satisfy)
|
|
- backend/config.py (Plan 01 output — confirm Settings field names match what factory reads)
|
|
</read_first>
|
|
<action>
|
|
Create `backend/storage/__init__.py`. Inside it: `from storage.base import StorageBackend`, `from storage.minio_backend import MinIOBackend`, `from config import settings`. Define `def get_storage_backend() -> StorageBackend:` that returns `MinIOBackend(endpoint=settings.minio_endpoint, access_key=settings.minio_access_key, secret_key=settings.minio_secret_key, bucket=settings.minio_bucket, secure=False)`. The `secure=False` is correct for Docker internal HTTP traffic between containers — RESEARCH.md Pattern 3. Add a module docstring noting this file mirrors `backend/ai/__init__.py`.
|
|
|
|
Create `backend/storage/base.py` with the exact contents from RESEARCH.md Pattern 8 (lines 612-640): `from abc import ABC, abstractmethod`, `class StorageBackend(ABC):` with five `@abstractmethod async def` methods — `put_object(self, user_id: str, document_id: str, file_bytes: bytes, extension: str, content_type: str) -> str`, `get_object(self, object_key: str) -> bytes`, `delete_object(self, object_key: str) -> None`, `presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str`, `health_check(self) -> bool`. Each abstract method has a one-line docstring per RESEARCH.md.
|
|
|
|
Create `backend/storage/minio_backend.py` implementing `MinIOBackend(StorageBackend)`. Imports: `import asyncio`, `import io`, `import uuid`, `from datetime import timedelta`, `from minio import Minio`, `from storage.base import StorageBackend`. Constructor `def __init__(self, endpoint: str, access_key: str, secret_key: str, bucket: str, secure: bool = False)` stores `self._bucket = bucket` and `self._client = Minio(endpoint=endpoint, access_key=access_key, secret_key=secret_key, secure=secure)`.
|
|
|
|
Implement `async def put_object(self, user_id: str, document_id: str, file_bytes: bytes, extension: str, content_type: str) -> str`: construct `object_key = f"{user_id}/{document_id}/{uuid.uuid4()}{extension}"`. Build `data = io.BytesIO(file_bytes)` (the constructor sets pointer at 0 — no `seek(0)` needed per Pitfall 3, but add an explicit `data.seek(0)` immediately afterward as a belt-and-braces safety). Call `await asyncio.to_thread(self._client.put_object, self._bucket, object_key, data, length=len(file_bytes), content_type=content_type)`. Return `object_key`. IMPORTANT: do NOT accept any `filename`/`original_name` parameter — STORE-02 requires the human filename never reaches this method.
|
|
|
|
Implement `async def get_object(self, object_key: str) -> bytes`: define a sync helper `def _fetch():` inside the method that calls `response = self._client.get_object(self._bucket, object_key)` then `try: return response.read() finally: response.close(); response.release_conn()`. Call `return await asyncio.to_thread(_fetch)`. This pattern correctly handles MinIO's HTTPResponse-returning `get_object`.
|
|
|
|
Implement `async def delete_object(self, object_key: str) -> None`: `await asyncio.to_thread(self._client.remove_object, self._bucket, object_key)`.
|
|
|
|
Implement `async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str`: `return await asyncio.to_thread(self._client.presigned_get_object, self._bucket, object_key, timedelta(minutes=expires_minutes))`.
|
|
|
|
Implement `async def health_check(self) -> bool`: wrap `await asyncio.to_thread(self._client.bucket_exists, self._bucket)` in a `try/except Exception: return False`; the truthy path returns the exact boolean from the SDK.
|
|
|
|
Finally, after writing the implementation, edit the six tests in `backend/tests/test_storage.py` to REMOVE the `@pytest.mark.xfail(strict=False, reason="implemented in plan 04")` markers (the tests should now PASS unmarked) — but keep the existing `try/except ImportError: pytest.skip(...)` blocks because they harmlessly fall through when imports succeed.
|
|
</action>
|
|
<verify>
|
|
<automated>cd /Users/nik/Documents/Progamming/document_scanner/backend && python3 -m pytest tests/test_storage.py -v 2>&1 | tail -20 && python3 -c "
|
|
from storage.base import StorageBackend
|
|
from storage import get_storage_backend
|
|
from storage.minio_backend import MinIOBackend
|
|
assert isinstance(get_storage_backend(), MinIOBackend)
|
|
abstract = {m for m in StorageBackend.__abstractmethods__}
|
|
assert abstract == {'put_object','get_object','delete_object','presigned_get_url','health_check'}, f'abstract mismatch: {abstract}'
|
|
print('storage-abc-ok')
|
|
"</automated>
|
|
</verify>
|
|
<acceptance_criteria>
|
|
- `backend/storage/base.py` exists and contains `class StorageBackend(ABC):`
|
|
- `backend/storage/base.py` declares exactly 5 `@abstractmethod` methods (verifiable via `grep -c "@abstractmethod" backend/storage/base.py >= 5`)
|
|
- The 5 abstract method names are `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check` (verified by the inline `__abstractmethods__` assertion in the Verify command)
|
|
- `backend/storage/__init__.py` contains `def get_storage_backend() -> StorageBackend:` and returns a `MinIOBackend(endpoint=settings.minio_endpoint, ...)`
|
|
- `backend/storage/minio_backend.py` contains `class MinIOBackend(StorageBackend):`
|
|
- `backend/storage/minio_backend.py` calls `asyncio.to_thread(self._client.put_object`, `asyncio.to_thread(self._client.remove_object`, `asyncio.to_thread(self._client.presigned_get_object`, `asyncio.to_thread(self._client.bucket_exists` (each verifiable via grep — each must appear at least once)
|
|
- `backend/storage/minio_backend.py` contains the literal substring `f"{user_id}/{document_id}/{uuid.uuid4()}{extension}"` (the STORE-02 key format)
|
|
- `backend/storage/minio_backend.py` does NOT take any `filename` or `original_name` parameter on `put_object` (verifiable: `grep -E "def put_object.*filename|def put_object.*original_name" backend/storage/minio_backend.py` exits with no matches)
|
|
- `cd backend && python3 -m pytest tests/test_storage.py -v` exits 0 with all 6 tests reporting PASSED (no XFAIL, no SKIPPED, no FAILED) — verifiable via `python3 -m pytest tests/test_storage.py -v 2>&1 | grep -c "PASSED" >= 6`
|
|
- The `@pytest.mark.xfail` markers in `backend/tests/test_storage.py` are removed (verifiable: `grep -c "@pytest.mark.xfail" backend/tests/test_storage.py | grep -q "^0$"`)
|
|
</acceptance_criteria>
|
|
<done>The `storage/` module mirrors the `ai/` provider pattern exactly; `MinIOBackend` correctly wraps every sync SDK call in `asyncio.to_thread`; STORE-02 object key format is enforced in code; all six Plan 02 `test_storage.py` scaffolds pass.</done>
|
|
</task>
|
|
|
|
<task type="auto" tdd="true">
|
|
<name>Task 2: Rewrite backend/services/storage.py — async SQLAlchemy + MinIO orchestrator</name>
|
|
<files>backend/services/storage.py</files>
|
|
<behavior>
|
|
- `save_upload(session, file_bytes, original_name, mime_type) -> dict`: creates a `Document` row with a freshly generated UUID `id`, `filename=original_name`, `content_type=mime_type`, `size_bytes=len(file_bytes)`, `user_id=None` (D-03 — no auth in Phase 1), `storage_backend="minio"`, `status="pending"`; calls `await get_storage_backend().put_object(user_id=str(user_id or "null-user"), document_id=str(doc.id), file_bytes=file_bytes, extension=Path(original_name).suffix.lower(), content_type=mime_type)`; stores the returned `object_key` on the `Document.object_key` column; commits; returns a dict shape compatible with the existing API: `{"id": str(doc.id), "filename": original_name, "path": object_key, "object_key": object_key}` (the `path` key is kept for compatibility with the current `api/documents.py` line 33 — it now holds the object_key, not a filesystem path)
|
|
- `save_metadata(session, meta: dict) -> None`: idempotent update of a `Document` row from the dict shape used by the legacy `api/documents.py` (`id`, `original_name`, `extracted_text`, `topics`, `classified_at`, etc.); maps `meta["original_name"] → Document.filename`, `meta["mime_type"] → Document.content_type`, `meta["size_bytes"] → Document.size_bytes`, `meta["extracted_text"] → Document.extracted_text`; `topics` list is reconciled into the `DocumentTopic` association table (delete existing rows for the document, insert one per topic name — auto-creating topics that don't yet exist via the existing `create_topic` helper); `classified_at → Document.updated_at`
|
|
- `get_metadata(session, doc_id: str) -> dict | None`: returns `None` for not-found; otherwise returns a dict matching the response shape `api/documents.py` produces today: `{"id", "original_name", "filename", "mime_type", "size_bytes", "extracted_text", "topics" (list of topic name strings), "created_at", "classified_at"}`
|
|
- `list_metadata(session, topic: str | None = None) -> list[dict]`: returns the same dict shape per row; if `topic` is provided, filters to documents joined to a `Topic` row whose `name == topic`
|
|
- `delete_document(session, doc_id: str) -> bool`: returns `False` if the document does not exist; otherwise calls `await get_storage_backend().delete_object(doc.object_key)`, deletes the `Document` row (cascade deletes `DocumentTopic` rows via the ORM relationship `ondelete="CASCADE"`), commits, returns `True`
|
|
- `update_document_topics(session, doc_id, topics) -> dict | None`: replaces the association rows and returns the refreshed metadata dict (or `None` for not-found)
|
|
- `remove_topic_from_all_documents(session, topic_name) -> int`: deletes all `DocumentTopic` rows whose `topic_id` matches the topic with that name; returns the count
|
|
- Topic functions (`load_topics`, `create_topic`, `update_topic`, `delete_topic`, `get_topic`, `topic_doc_counts`) are async and query the `Topic` and `DocumentTopic` tables; `create_topic` auto-deduplicates by lowercased name (preserving the existing behavior); response shape matches the existing JSON: `{"id", "name", "description", "color"}` per topic; topic `id` is the UUID stringified
|
|
- Settings functions (`load_settings`, `save_settings`, `mask_api_key`, `settings_masked`) continue to read/write the flat `SETTINGS_FILE` JSON in Phase 1 — these will move to the `users` table in Phase 2; document this with a comment
|
|
- NO `filelock` import remains in this module
|
|
- NO `json.loads` / `json.dumps` / `Path.read_text` / `Path.write_text` calls touch document or topic data (only the still-flat `SETTINGS_FILE` may keep using JSON)
|
|
</behavior>
|
|
<read_first>
|
|
- backend/services/storage.py (current 188-line file — read in full; understand every function signature and return shape so the rewrite preserves the contract)
|
|
- backend/db/models.py (Plan 03 output — exact column names and relationships)
|
|
- backend/api/documents.py (read in full — note exactly which `storage.*` calls are made and what return-shape fields are consumed)
|
|
- backend/api/topics.py (read in full — same scrutiny for topic-related calls)
|
|
- backend/services/classifier.py (read in full — uses `storage.get_metadata`, `storage.load_settings`, `storage.load_topics`, `storage.create_topic`, `storage.update_document_topics`)
|
|
- .planning/phases/01-infrastructure-foundation/01-PATTERNS.md (backend/services/storage.py section — explains the contract preservation strategy)
|
|
- .planning/phases/01-infrastructure-foundation/01-RESEARCH.md (Pitfall 1 — expire_on_commit=False; Pattern 1 — session lifecycle)
|
|
- .planning/phases/01-infrastructure-foundation/01-CONTEXT.md (D-04 data dir deleted, D-05 storage layer switched, D-06 object key schema)
|
|
</read_first>
|
|
<action>
|
|
Replace `backend/services/storage.py` entirely. New imports: `import uuid`, `import json`, `from datetime import datetime, timezone`, `from pathlib import Path`, `from sqlalchemy import select, func as sql_func, delete`, `from sqlalchemy.ext.asyncio import AsyncSession`, `from sqlalchemy.orm import selectinload`, `from db.models import Document, Topic, DocumentTopic`, `from storage import get_storage_backend`, `from config import settings as cfg_settings, DEFAULT_SETTINGS, SETTINGS_FILE`. Do NOT import `filelock` — that dependency is removed.
|
|
|
|
Add a module-level `_storage = None` and a `def _backend(): global _storage; _storage = _storage or get_storage_backend(); return _storage` helper so the MinIO backend is lazily instantiated once per process (matching the singleton-like behavior of the previous module-level filelock objects).
|
|
|
|
Implement every function below as `async def` (except `mask_api_key`, `settings_masked` which stay sync). The session is the first positional parameter for every DB-touching function.
|
|
|
|
1. `async def save_upload(session: AsyncSession, file_bytes: bytes, original_name: str, mime_type: str) -> dict`: generate `doc_id = uuid.uuid4()`. Extract `suffix = Path(original_name).suffix.lower()`. Build `doc = Document(id=doc_id, user_id=None, filename=original_name, content_type=mime_type, size_bytes=len(file_bytes), storage_backend="minio", status="pending", object_key="")`. `session.add(doc)`. `await session.flush()` (to materialize `doc.id` without committing). Call `object_key = await _backend().put_object(user_id="null-user", document_id=str(doc_id), file_bytes=file_bytes, extension=suffix, content_type=mime_type)` — the literal sentinel `"null-user"` is used because there is no user in Phase 1 (D-03); Phase 2 will replace this with `str(current_user.id)`. Set `doc.object_key = object_key`. `await session.commit()`. Return `{"id": str(doc_id), "filename": original_name, "path": object_key, "object_key": object_key, "user_id": None}`.
|
|
|
|
2. `async def save_metadata(session: AsyncSession, meta: dict) -> None`: look up `doc = await session.get(Document, uuid.UUID(meta["id"]))`. If not found, return. Set `doc.extracted_text = meta.get("extracted_text", "")`. If `meta.get("topics")` is a non-empty list of strings, call `await update_document_topics(session, meta["id"], meta["topics"])` (which handles association table reconciliation). Set `doc.status = "classified" if meta.get("classified_at") else "pending"`. `await session.commit()`.
|
|
|
|
3. `async def get_metadata(session: AsyncSession, doc_id: str) -> dict | None`: try `uid = uuid.UUID(doc_id)` inside try/except `ValueError: return None` (keep the legacy string-doc-id tolerance from `api/documents.py`). `doc = await session.get(Document, uid)`. If `None`, return `None`. Load topics: `topics_q = await session.execute(select(Topic.name).join(DocumentTopic, DocumentTopic.topic_id == Topic.id).where(DocumentTopic.document_id == uid))`; `topic_names = [r[0] for r in topics_q]`. Return `{"id": str(doc.id), "original_name": doc.filename, "filename": doc.filename, "mime_type": doc.content_type, "size_bytes": doc.size_bytes, "extracted_text": doc.extracted_text or "", "topics": topic_names, "created_at": doc.created_at.isoformat() if doc.created_at else None, "classified_at": doc.updated_at.isoformat() if doc.status == "classified" else None}`.
|
|
|
|
4. `async def list_metadata(session: AsyncSession, topic: str | None = None) -> list[dict]`: base query `stmt = select(Document).order_by(Document.created_at.desc())`. If `topic` is provided, join `DocumentTopic` and `Topic` and filter `Topic.name == topic`. Execute, then for each `Document` call the same dict-build code as `get_metadata` (factor it out into a private `def _doc_to_dict(doc, topic_names)` helper).
|
|
|
|
5. `async def delete_document(session: AsyncSession, doc_id: str) -> bool`: lookup as above; if not found, return `False`. Call `await _backend().delete_object(doc.object_key)` inside try/except (log to stderr on failure but still proceed with DB delete — the bytes may already be gone). `await session.delete(doc)`. `await session.commit()`. Return `True`.
|
|
|
|
6. `async def update_document_topics(session: AsyncSession, doc_id: str, topics: list[str]) -> dict | None`: lookup the doc; return `None` if missing. Delete all existing `DocumentTopic` rows for this document via `await session.execute(delete(DocumentTopic).where(DocumentTopic.document_id == uid))`. For each name in `topics` (deduped), call `topic_dict = await create_topic(session, name)` (auto-create if missing) and insert a `DocumentTopic(document_id=uid, topic_id=uuid.UUID(topic_dict["id"]))`. Set `doc.status = "classified"`. `await session.commit()`. Return `await get_metadata(session, doc_id)`.
|
|
|
|
7. `async def remove_topic_from_all_documents(session: AsyncSession, topic_name: str) -> int`: find the topic by name; if missing, return 0. Run `result = await session.execute(delete(DocumentTopic).where(DocumentTopic.topic_id == topic.id))`. `await session.commit()`. Return `result.rowcount`.
|
|
|
|
8. `async def load_topics(session: AsyncSession) -> list[dict]`: `q = await session.execute(select(Topic).order_by(Topic.name))`. Return `[{"id": str(t.id), "name": t.name, "description": t.description, "color": t.color} for t in q.scalars()]`.
|
|
|
|
9. `async def save_topics(session: AsyncSession, topics: list[dict]) -> None`: idempotent bulk replace — delete all `Topic` rows, then insert the provided list. (Phase 1 use: not directly called by any current endpoint; preserved for legacy compatibility — add a `# legacy: not used by current endpoints` comment.)
|
|
|
|
10. `async def get_topic(session: AsyncSession, topic_id: str) -> dict | None`: `t = await session.get(Topic, uuid.UUID(topic_id))`; return `None` or `{"id": str(t.id), ...}`.
|
|
|
|
11. `async def create_topic(session: AsyncSession, name: str, description: str = "", color: str = "#6366f1") -> dict`: case-insensitive deduplication — `q = await session.execute(select(Topic).where(sql_func.lower(Topic.name) == name.lower()))`. If a row exists, return its dict shape. Otherwise insert a new `Topic(name=name, description=description, color=color)`, commit, return dict.
|
|
|
|
12. `async def update_topic(session: AsyncSession, topic_id: str, name=None, description=None, color=None) -> dict | None`: lookup; if missing, `None`. Update only non-None fields. Commit. Return dict.
|
|
|
|
13. `async def delete_topic(session: AsyncSession, topic_id: str) -> str | None`: lookup; if missing, `None`. Save the name. Delete the Topic (cascade removes DocumentTopic rows via `ondelete="CASCADE"`). Commit. Return the saved name.
|
|
|
|
14. `async def topic_doc_counts(session: AsyncSession) -> dict[str, int]`: `q = await session.execute(select(Topic.name, sql_func.count(DocumentTopic.document_id)).join(DocumentTopic, DocumentTopic.topic_id == Topic.id, isouter=True).group_by(Topic.name))`. Return `{name: count for name, count in q}`.
|
|
|
|
15. `def load_settings() -> dict`: keep sync; read `SETTINGS_FILE` (still flat JSON) — if file missing, return a deep copy of `DEFAULT_SETTINGS`. Add comment `# Phase 2 will move per-user settings to users.ai_provider / users.ai_model`.
|
|
|
|
16. `def save_settings(settings: dict) -> None`: keep sync; write `SETTINGS_FILE`. No filelock — Phase 1 settings file is single-writer.
|
|
|
|
17. `def mask_api_key(key: str) -> str` and `def settings_masked(settings: dict) -> dict`: keep verbatim from current file (pure functions).
|
|
|
|
At the bottom of the file, expose every function via explicit `__all__ = ["save_upload", "save_metadata", "get_metadata", "list_metadata", "delete_document", "update_document_topics", "remove_topic_from_all_documents", "load_topics", "save_topics", "get_topic", "create_topic", "update_topic", "delete_topic", "topic_doc_counts", "load_settings", "save_settings", "mask_api_key", "settings_masked"]` so importers see the public surface.
|
|
</action>
|
|
<verify>
|
|
<automated>cd /Users/nik/Documents/Progamming/document_scanner/backend && python3 -c "
|
|
import inspect
|
|
import services.storage as st
|
|
async_funcs = ['save_upload','save_metadata','get_metadata','list_metadata','delete_document','update_document_topics','remove_topic_from_all_documents','load_topics','save_topics','get_topic','create_topic','update_topic','delete_topic','topic_doc_counts']
|
|
sync_funcs = ['load_settings','save_settings','mask_api_key','settings_masked']
|
|
for f in async_funcs:
|
|
assert inspect.iscoroutinefunction(getattr(st, f)), f'{f} should be async'
|
|
for f in sync_funcs:
|
|
fn = getattr(st, f)
|
|
assert not inspect.iscoroutinefunction(fn), f'{f} should be sync'
|
|
assert 'filelock' not in open(st.__file__).read(), 'filelock import still present'
|
|
assert 'FileLock' not in open(st.__file__).read(), 'FileLock import still present'
|
|
print('services-storage-rewrite-ok')
|
|
"</automated>
|
|
</verify>
|
|
<acceptance_criteria>
|
|
- `backend/services/storage.py` no longer imports `filelock` or `FileLock` (verifiable: `grep -cE "from filelock|import filelock|FileLock" backend/services/storage.py | grep -q "^0$"`)
|
|
- `backend/services/storage.py` imports `from sqlalchemy.ext.asyncio import AsyncSession` and `from db.models import Document, Topic, DocumentTopic`
|
|
- `backend/services/storage.py` imports `from storage import get_storage_backend`
|
|
- All 14 DB-touching functions are `async def` (verified by the inline `inspect.iscoroutinefunction` assertions in the Verify command)
|
|
- All 4 pure functions (`load_settings`, `save_settings`, `mask_api_key`, `settings_masked`) remain sync
|
|
- `save_upload` constructs an object_key via the MinIO backend (verifiable: `grep -c "await.*put_object" backend/services/storage.py >= 1`)
|
|
- `save_upload` passes `user_id="null-user"` (D-03 sentinel) — verifiable via `grep -c 'user_id="null-user"\|user_id=.null-user.' backend/services/storage.py >= 1`
|
|
- The file declares `__all__ = [...]` listing all 18 public functions (verifiable: `grep -c "__all__" backend/services/storage.py >= 1`)
|
|
- The file has at least 120 lines (the `must_haves.artifacts.min_lines` target — verifiable: `wc -l backend/services/storage.py | awk '{exit ($1 >= 120) ? 0 : 1}'`)
|
|
- No `Path(...).read_text()` or `Path(...).write_text()` call references `METADATA_DIR`, `UPLOADS_DIR`, `TOPICS_FILE` (verifiable: `grep -cE "METADATA_DIR|UPLOADS_DIR|TOPICS_FILE" backend/services/storage.py | grep -q "^0$"`)
|
|
- `SETTINGS_FILE` references remain (for `load_settings`/`save_settings` — verifiable: `grep -c "SETTINGS_FILE" backend/services/storage.py >= 1`)
|
|
- `cd backend && python3 -c "import services.storage; print('import-ok')"` exits 0 with `import-ok`
|
|
- All previous Plan 02 `test_storage.py` tests still PASSED (the rewrite must not have removed the `MinIOBackend` exports or the factory) — `cd backend && python3 -m pytest tests/test_storage.py -v` exits 0 with 6 PASSED
|
|
</acceptance_criteria>
|
|
<done>`backend/services/storage.py` is fully async, all DB access goes through SQLAlchemy ORM, all object storage goes through the `MinIOBackend` via `get_storage_backend()`, no flat-file or filelock code remains for documents or topics. The public function names are preserved so Plan 05 can wire callers with `def → async def + await` plus a `session` parameter.</done>
|
|
</task>
|
|
|
|
</tasks>
|
|
|
|
<threat_model>
|
|
## Trust Boundaries
|
|
|
|
| Boundary | Description |
|
|
|----------|-------------|
|
|
| `services/storage.py` → `storage/minio_backend.py` | Internal Python call; bytes never cross a network boundary inside this module |
|
|
| `storage/minio_backend.py` → MinIO server | App connects with `MINIO_ACCESS_KEY` (app-level credentials, not root) over Docker internal HTTP |
|
|
| `services/storage.py` → PostgreSQL via `AsyncSession` | All queries parameterized via SQLAlchemy ORM (SEC-03) |
|
|
|
|
## STRIDE Threat Register
|
|
|
|
| Threat ID | Category | Component | Disposition | Mitigation Plan |
|
|
|-----------|----------|-----------|-------------|-----------------|
|
|
| T-01-04-01 | Tampering | Object key prediction / path traversal | mitigate | `MinIOBackend.put_object` constructs keys server-side as `{user_id}/{document_id}/{uuid4()}{ext}` (STORE-02, D-06); no caller can inject a key. The `extension` is derived from `Path(original_name).suffix.lower()` — the only piece of user input — and is concatenated as a suffix to a freshly generated UUID, so a malicious extension cannot escape the prefix (the leading `{user_id}/{document_id}/{uuid4}` portion is always server-controlled). Wave 0 tests `test_object_key_schema` + `test_filename_not_in_object_key` enforce this on every CI run. |
|
|
| T-01-04-02 | Information Disclosure | Human filename leaking into the object store namespace | mitigate | `MinIOBackend.put_object` does NOT accept a `filename` parameter (only `extension`). The human filename is stored only in the `documents.filename` DB column (D-06). Test `test_filename_not_in_object_key` asserts this on every run. |
|
|
| T-01-04-03 | Information Disclosure | Extension as a covert channel (e.g., `.html` for XSS via download) | accept | Phase 1 has no download flow exposed to the browser (no presigned URL endpoint, no proxied download endpoint). Phase 4 (DOC-02) will introduce PDF-only proxied preview and content-type validation. Documented in SKELETON.md. |
|
|
| T-01-04-04 | Tampering | SQL injection via ORM string concatenation | mitigate | All queries use `select(...)`, `delete(...)`, `session.execute(...)` with parameterized values — never f-strings or `.format()` for SQL. The only string interpolation is the MinIO object key, which is not SQL. SEC-03 / CLAUDE.md. |
|
|
| T-01-04-05 | Denial of Service | Unbounded list_metadata returning all documents | accept | Phase 1 has only the single test user's documents (≤ 100 MB of uploads from D-04 reset state). Phase 3 (STORE-03) introduces pagination + quota enforcement. The existing pagination in `api/documents.py` (`page`, `per_page`) remains in effect — Plan 05 will wire it. |
|
|
| T-01-04-SC | Tampering | npm/pip/cargo installs | N/A | No new installs in this plan; uses minio/sqlalchemy from Plan 01. RESEARCH.md Package Legitimacy Audit covers both (slopcheck OK). |
|
|
</threat_model>
|
|
|
|
<verification>
|
|
- `cd backend && python3 -m pytest tests/test_storage.py -v` exits 0 with all 6 tests PASSED (no XFAIL).
|
|
- `cd backend && python3 -c "from services import storage; import inspect; assert inspect.iscoroutinefunction(storage.save_upload)"` exits 0.
|
|
- `grep -c "filelock\|FileLock" backend/services/storage.py` equals 0.
|
|
- The full test suite `cd backend && python3 -m pytest tests/ -v` still exits 0 — the legacy sync tests (`test_documents.py` un-suffixed names) still pass against the OLD flat-file behavior because Plan 05 has not yet swapped `api/documents.py` callers; this is intentional and documents the cutover boundary.
|
|
</verification>
|
|
|
|
<success_criteria>
|
|
- `backend/storage/base.py`, `backend/storage/__init__.py`, `backend/storage/minio_backend.py` exist; the ABC + factory pattern mirrors `backend/ai/` exactly.
|
|
- `backend/services/storage.py` is rewritten with async signatures, ORM access, and the MinIO backend integration; `filelock` is gone from this module.
|
|
- All six Plan 02 `tests/test_storage.py` tests pass (xfail markers removed).
|
|
- The legacy `tests/test_documents.py` sync tests continue to pass — the cutover happens in Plan 05; this plan is additive on the storage layer side.
|
|
</success_criteria>
|
|
|
|
<output>
|
|
Create `.planning/phases/01-infrastructure-foundation/01-04-SUMMARY.md` when done. Include: full function-by-function summary of `services/storage.py` changes (old signature → new signature), the MinIO key format produced for a sample upload, and any helpers introduced (e.g., `_doc_to_dict`, `_backend()` singleton).
|
|
</output>
|