# Phase 1: Infrastructure Foundation - Research **Researched:** 2026-05-21 **Domain:** PostgreSQL + MinIO + Redis + Celery wired into FastAPI via Docker Compose; Alembic async migrations; storage service rewrite **Confidence:** HIGH --- ## User Constraints (from CONTEXT.md) ### Locked Decisions **Schema Scope** - D-01: Phase 1 initial Alembic migration creates the full v1 skeleton — all tables: `users`, `refresh_tokens`, `quotas`, `documents`, `topics`, `folders`, `shares`, `audit_log`, `cloud_connections`. Subsequent phases add data and constraints, not new tables. - D-02: `groups` table stub included in Phase 1 migration (v2 feature; empty table, correct columns and FKs). - D-03: `documents.user_id` is nullable in Phase 1 (no auth system yet). Phase 2 migration adds the NOT NULL constraint after the user/auth system is live. - D-04: Existing `data/` directory contents (flat-file JSON metadata + uploaded files) are deleted in Phase 1. Test data only — no migration script needed. **App Wiring** - D-05: Phase 1 switches the storage service layer to PostgreSQL + MinIO. `backend/services/storage.py` is rewritten to use async SQLAlchemy + MinIO SDK. The app does not continue using the filesystem after Phase 1. - D-06: Single MinIO bucket named `docuvault`. Object keys follow `{user_id}/{document_id}/{uuid4()}{ext}` (STORE-02). Human-readable filenames stored in the `documents.filename` DB column only — never in the MinIO key. - D-07: `backend/main.py` `/health` endpoint extended to check PostgreSQL + MinIO connectivity (not just `{"status": "ok"}`). Health checks gate `docker compose up` readiness. **Background Worker** - D-08: Background task queue: Celery + Redis (STORE-08). FastAPI `BackgroundTasks` replaced. - D-09: Redis service added to `docker-compose.yml` in Phase 1. Redis doubles as the rate-limiting store for Phase 2 auth endpoints — no second Redis needed later. - D-10: A `celery-worker` service is added to `docker-compose.yml`. Celery broker and result backend both point to the same Redis instance via `REDIS_URL`. **Env / Secrets Strategy** - D-11: `.env` gitignored + `.env.example` committed. `docker-compose.yml` reads vars via `${VAR_NAME}`. `.env.example` has safe placeholder values and comments explaining each variable. - D-12: Production secrets stored outside the project directory at `/etc/docuvault/env` (`chmod 600`, owned by the service user, not root). `docker-compose.yml` references it via `env_file:`. Documented in deployment notes. - D-13: Two PostgreSQL DSNs: `DATABASE_URL` (restricted app user `docuvault_app`, SELECT/INSERT/UPDATE/DELETE only; no DDL) and `DATABASE_MIGRATE_URL` (migration user `docuvault_migrate`, DDL privileges; used only by Alembic). - D-14: PostgreSQL init script in `docker/postgres/initdb.d/` provisions both users on first container start. The app never connects as the PostgreSQL superuser. - D-15: MinIO vars: `MINIO_ENDPOINT`, `MINIO_ROOT_USER`, `MINIO_ROOT_PASSWORD` (init only), `MINIO_BUCKET` (value: `docuvault`), `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY` (separate app-level access key pair with minimal bucket permissions). - D-16: Additional vars in Phase 1 `.env.example`: `REDIS_URL`, `SECRET_KEY` (documented now for Phase 2 JWT + HKDF use; app does not read it in Phase 1). ### Claude's Discretion None — user made explicit choices for all areas. ### Deferred Ideas (OUT OF SCOPE) None — discussion stayed within phase scope. --- ## Phase Requirements | ID | Description | Research Support | |----|-------------|------------------| | STORE-01 | Platform storage layer migrated from flat-file JSON + local filesystem to PostgreSQL (metadata) + MinIO (objects) | SQLAlchemy 2.0 async ORM + MinIO SDK patterns documented; service rewrite approach confirmed | | STORE-02 | Each user's MinIO objects use `{user_id}/{document_id}/{uuid4()}{ext}` keys — human-readable filenames stored in DB only | MinIO `put_object()` API confirmed; key schema enforced in model/service layer | | STORE-07 | Backend is stateless — no per-instance file locks; multiple instances can run behind a load balancer | PostgreSQL atomic UPDATE + Celery + Redis replaces filelock pattern; verified | --- ## Summary Phase 1 replaces the entire flat-file persistence layer (JSON metadata + local filesystem uploads) with PostgreSQL (via SQLAlchemy 2.0 async ORM) + MinIO (via the official Python SDK) wired into Docker Compose. Redis and a Celery worker are added alongside as the background task queue that replaces FastAPI `BackgroundTasks`, delivering statelessness required by STORE-07. All infrastructure services are health-checked and ordered via `depends_on` conditions so `docker compose up` can be treated as the single operational command. Alembic manages the schema using the async migration template with a two-DSN strategy (restricted app user + DDL migration user). The walking skeleton requirement is satisfied by: the full v1 schema applied via Alembic, one real document upload persisted to PostgreSQL and MinIO through the rewritten storage service, and the `/health` endpoint returning live connectivity checks for all three services. The existing single-user document upload → text extraction → AI classification workflow continues to work end-to-end after Phase 1. The Vue frontend requires no changes. All API routes and response shapes are preserved. **Primary recommendation:** Wire infrastructure with Docker Compose health checks first; apply Alembic migration second; rewrite `services/storage.py` third; replace `BackgroundTasks` with Celery tasks last. This ordering allows each layer to be verified before the next is built. --- ## Architectural Responsibility Map | Capability | Primary Tier | Secondary Tier | Rationale | |------------|-------------|----------------|-----------| | Document metadata persistence | Database / Storage (PostgreSQL) | API / Backend | All metadata is authored and read server-side; no client involvement | | Binary file storage | Database / Storage (MinIO) | API / Backend | Object store owns bytes; backend generates keys and proxies operations | | Background text extraction + classification | Background Worker (Celery) | API / Backend | CPU-intensive, deferred; must not block HTTP event loop | | Health checking | API / Backend | Docker Compose | FastAPI `/health` probes PostgreSQL + MinIO; Compose waits on it | | Schema migrations | Database / Storage (Alembic + PostgreSQL) | — | DDL-only responsibility; executed before app starts | | Object key namespacing | API / Backend (service layer) | — | Key construction is a code concern, not a storage concern | | Service ordering / startup sequencing | CDN / Static (Docker Compose) | — | `depends_on: condition: service_healthy` enforces boot order | | Connection pooling | API / Backend (SQLAlchemy pool) | Database / Storage | App holds pool; PostgreSQL is the pooled resource | | Task queue / broker | Background Worker (Redis / Celery) | API / Backend | Broker is Redis; workers are separate Docker Compose services | --- ## Standard Stack ### Core | Library | Version | Purpose | Why Standard | |---------|---------|---------|--------------| | `sqlalchemy[asyncio]` | `>=2.0.49` | ORM + async engine + connection pool | Industry standard for Python async PostgreSQL; `create_async_engine` + `async_sessionmaker` pattern is the canonical FastAPI integration | | `psycopg[binary]` | `>=3.3.4` | PostgreSQL async driver | psycopg v3 (`psycopg`) is SQLAlchemy 2.0's preferred async dialect; `[binary]` provides pre-built wheels with no system dependency on libpq headers | | `alembic` | `>=1.18.4` | Database migrations | The only maintained migration tool for SQLAlchemy; provides async template (`alembic init -t async`) | | `minio` | `>=7.2.20` | MinIO / S3 object storage SDK | Official MinIO Python SDK; stable API for `put_object`, `get_object`, `bucket_exists`, `presigned_get_object` | | `celery[redis]` | `>=5.6.3` | Background task queue + Redis transport | Battle-tested distributed task queue; `[redis]` extra installs `redis` client; replaces per-instance `BackgroundTasks` | | `redis` | `>=7.4.0` | Redis Python client (Celery dependency + Phase 2 rate limiting) | Official Redis client; installed transitively by `celery[redis]` but worth pinning for Phase 2 rate limiting use | ### Supporting | Library | Version | Purpose | When to Use | |---------|---------|---------|-------------| | `pydantic-settings` | `>=2.2` | Env var configuration (already in project) | Extended with new DATABASE_URL, MINIO_*, REDIS_URL vars | | `anyio` | `>=4.13.0` | Async testing utilities | Required by `httpx` for async test transport in pytest | | `httpx` | `>=0.28.1` | Async HTTP client for integration tests | Needed to replace `TestClient` (sync) with `AsyncClient` for async route testing | | `pytest-asyncio` | `>=1.3.0` | Async test runner integration | Already in project as `>=0.23`; upgrade to `>=1.3.0` for `asyncio_mode = auto` support in new async tests | ### Alternatives Considered | Instead of | Could Use | Tradeoff | |------------|-----------|----------| | `psycopg[binary]` | `asyncpg` | `asyncpg` is faster in benchmarks but requires a separate sync driver (`psycopg2`) for Alembic. `psycopg` v3 works for both sync (Alembic) and async (FastAPI) with the same URL — zero driver switching | | `celery[redis]` | `pgqueuer` / `pg_boss` | pgqueuer uses PostgreSQL as the queue (no Redis required). However, the user explicitly selected Celery + Redis. Redis is also needed in Phase 2 for rate limiting, so Redis is justified regardless | | `minio` Python SDK (sync, wrapped in `asyncio.to_thread`) | `aiobotocore` | MinIO SDK is the official client with full API coverage including MinIO-specific features. `aiobotocore` is AWS-oriented and less tested with MinIO-specific APIs. `to_thread()` wrapping is the correct async pattern for the sync SDK | **Installation (backend/requirements.txt additions):** ``` sqlalchemy[asyncio]>=2.0 psycopg[binary]>=3.3 alembic>=1.13 minio>=7.2 celery[redis]>=5.4 redis>=7.0 httpx>=0.27 pytest-asyncio>=0.23 ``` Note: `psycopg[binary]` is specified with bracket extras in requirements.txt. The binary extra installs a self-contained wheel — no system `libpq-dev` package required in the Docker image, simplifying the Dockerfile. --- ## Package Legitimacy Audit All packages verified on PyPI registry via `pip3 index versions` and `slopcheck install` (v0.6.1, run 2026-05-21). | Package | Registry | Age | Downloads | Source Repo | slopcheck | Disposition | |---------|----------|-----|-----------|-------------|-----------|-------------| | `sqlalchemy` | PyPI | ~20 yrs | Very high (millions/wk) | github.com/sqlalchemy/sqlalchemy | OK | Approved | | `psycopg` | PyPI | ~4 yrs (v3) | High | github.com/psycopg/psycopg | OK | Approved | | `alembic` | PyPI | ~12 yrs | Very high | github.com/sqlalchemy/alembic | OK | Approved | | `minio` | PyPI | ~8 yrs | High | github.com/minio/minio-py | OK | Approved | | `celery` | PyPI | ~15 yrs | Very high (millions/wk) | github.com/celery/celery | OK | Approved | | `redis` | PyPI | ~12 yrs | Very high | github.com/redis/redis-py | OK | Approved | **Packages removed due to slopcheck [SLOP] verdict:** none **Packages flagged as suspicious [SUS]:** none Note: `psycopg[binary]` is specified with extras syntax in requirements.txt; the installable wheel is `psycopg-binary` on PyPI, which also passed registry verification (version 3.3.4 confirmed). [VERIFIED: PyPI registry + slopcheck OK] --- ## Architecture Patterns ### System Architecture Diagram ``` Browser (Vue 3 SPA — unchanged in Phase 1) │ HTTP/JSON + multipart (same API contract) ▼ FastAPI (port 8000) — lifespan creates async engine, disposes on shutdown │ ├── api/documents.py ─── calls ──► services/storage.py (REWRITTEN) │ │ │ ├─► db/session.py (AsyncSession) │ │ │ │ │ ▼ │ │ PostgreSQL (port 5432) │ │ [docuvault_app user, restricted] │ │ │ └─► storage/minio_backend.py │ │ │ ▼ │ MinIO (port 9000) │ [bucket: docuvault] │ [app-level access key] │ ├── /health ─── probes ──► PostgreSQL + MinIO connectivity │ └── celery_app.py ─── enqueues tasks ──► Redis (port 6379) │ Celery Worker (separate container) ├── task: extract_and_classify() │ ├─► services/extractor.py │ └─► services/classifier.py └── consumes from Redis queue Alembic (run once at deploy time, not part of app startup) │ uses DATABASE_MIGRATE_URL (docuvault_migrate user, DDL privileges) └─► PostgreSQL — applies full v1 schema ``` ### Recommended Project Structure ``` backend/ ├── main.py # FastAPI app; extend lifespan for engine/dispose ├── config.py # pydantic-settings: extend with new env vars ├── celery_app.py # Celery app instance (broker from REDIS_URL) ├── db/ │ ├── __init__.py │ ├── session.py # async engine + async_sessionmaker │ └── models.py # all SQLAlchemy ORM models (full v1 schema) ├── deps/ │ └── db.py # get_db() — yields AsyncSession ├── services/ │ ├── storage.py # REPLACED: async SQLAlchemy + MinIO SDK │ ├── extractor.py # unchanged │ └── classifier.py # update to accept session; dispatch via Celery ├── storage/ # NEW: StorageBackend ABC + MinIO implementation │ ├── __init__.py # get_storage_backend() factory │ ├── base.py # StorageBackend ABC (mirrors ai/base.py) │ └── minio_backend.py # MinIO implementation ├── tasks/ │ └── document_tasks.py # Celery task definitions (extract_and_classify) ├── migrations/ # Alembic migration directory │ ├── env.py # async env.py with two-DSN strategy │ ├── script.py.mako │ └── versions/ │ └── 0001_initial_schema.py ├── alembic.ini # sqlalchemy.url = DATABASE_MIGRATE_URL ├── api/ │ ├── documents.py # update to use async storage service │ ├── topics.py # unchanged (topics still in DB after migration) │ └── settings.py # unchanged └── tests/ ├── conftest.py # UPDATE: add async engine + session fixtures ├── test_health.py # UPDATE: test PostgreSQL + MinIO health probes ├── test_documents.py # UPDATE: adapt for async storage layer └── test_storage.py # NEW: unit tests for MinIO object key schema ``` ### Pattern 1: SQLAlchemy 2.0 Async Engine + Session Factory (FastAPI Lifespan) **What:** Create engine once at startup, share it application-wide via `app.state`. Session factory (`async_sessionmaker`) yields per-request sessions via a FastAPI dependency. **When to use:** Any database access in FastAPI route handlers or services. **Example:** ```python # db/session.py from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession from config import settings engine = create_async_engine( settings.database_url, # postgresql+psycopg://docuvault_app:...@postgres/docuvault pool_pre_ping=True, # detect stale connections before use echo=False, ) AsyncSessionLocal = async_sessionmaker( engine, class_=AsyncSession, expire_on_commit=False, # prevent lazy-load errors after commit ) # deps/db.py from db.session import AsyncSessionLocal async def get_db(): async with AsyncSessionLocal() as session: try: yield session finally: await session.close() # main.py — lifespan from contextlib import asynccontextmanager from db.session import engine @asynccontextmanager async def lifespan(app: FastAPI): # Startup: engine creates pool on first connection yield # Shutdown: close all pooled connections await engine.dispose() app = FastAPI(lifespan=lifespan) ``` **Source:** [CITED: docs.sqlalchemy.org/en/20/orm/extensions/asyncio.html] **Key detail — URL format for psycopg v3:** ``` postgresql+psycopg://user:password@host:port/dbname ``` The same `postgresql+psycopg://` prefix works for both `create_engine()` (Alembic) and `create_async_engine()` (FastAPI). SQLAlchemy selects the sync or async dialect variant automatically. [CITED: docs.sqlalchemy.org/en/20/dialects/postgresql.html] **Key detail — `expire_on_commit=False`:** After `session.commit()`, SQLAlchemy marks all objects as expired and would trigger another SELECT on next attribute access. In async context, this causes `MissingGreenlet` errors because there's no active async context at that point. Setting `expire_on_commit=False` prevents this. [CITED: docs.sqlalchemy.org/en/20/orm/extensions/asyncio.html] --- ### Pattern 2: Alembic Async Configuration with Two DSNs **What:** Alembic's async template (`alembic init -t async`) generates `env.py` that uses `async_engine_from_config` and `asyncio.run()`. The `DATABASE_MIGRATE_URL` DSN (DDL privileges) is used only by Alembic; the app uses `DATABASE_URL` (restricted). This separates migration risk from runtime risk. **When to use:** Every `alembic upgrade head` call. Never used by FastAPI directly. **Example:** ```python # migrations/env.py (key section — async online migrations) import asyncio from sqlalchemy.ext.asyncio import async_engine_from_config from sqlalchemy import pool from alembic import context from db.models import Base # import all models so metadata is populated target_metadata = Base.metadata def do_run_migrations(connection): context.configure(connection=connection, target_metadata=target_metadata) with context.begin_transaction(): context.run_migrations() async def run_async_migrations(): connectable = async_engine_from_config( config.get_section(config.config_ini_section, {}), prefix="sqlalchemy.", poolclass=pool.NullPool, # migrations use per-run connection, not pool ) async with connectable.connect() as connection: await connection.run_sync(do_run_migrations) await connectable.dispose() def run_migrations_online(): asyncio.run(run_async_migrations()) ``` ```ini # alembic.ini [alembic] script_location = migrations sqlalchemy.url = %(DATABASE_MIGRATE_URL)s # reads from env via %(VAR)s interpolation ``` **Two-DSN in practice:** The `alembic.ini` `sqlalchemy.url` references `DATABASE_MIGRATE_URL`. FastAPI's `db/session.py` reads `DATABASE_URL`. Both are set in `.env`. The Docker Compose `backend` service has both env vars; the `celery-worker` service has `DATABASE_URL` only (workers need no DDL). **Source:** [CITED: alembic.sqlalchemy.org/en/latest/cookbook.html#using-asyncio-with-alembic] + [CITED: github.com/sqlalchemy/alembic/blob/main/alembic/templates/async/env.py] --- ### Pattern 3: MinIO SDK Sync-in-Async via `asyncio.to_thread()` **What:** The MinIO Python SDK is synchronous. In an async FastAPI context, blocking I/O blocks the event loop. Wrap MinIO SDK calls in `asyncio.to_thread()` to offload to a thread pool without blocking. **When to use:** All MinIO operations (`put_object`, `get_object`, `bucket_exists`, `presigned_get_object`) called from `async def` handlers or services. **Example:** ```python # storage/minio_backend.py import asyncio import io import uuid from datetime import timedelta from minio import Minio from storage.base import StorageBackend class MinIOBackend(StorageBackend): def __init__(self, endpoint: str, access_key: str, secret_key: str, bucket: str, secure: bool = False): self._client = Minio( endpoint=endpoint, access_key=access_key, secret_key=secret_key, secure=secure, # False for Docker internal network (HTTP) ) self._bucket = bucket async def put_object( self, user_id: str, document_id: str, file_bytes: bytes, extension: str, content_type: str, ) -> str: object_key = f"{user_id}/{document_id}/{uuid.uuid4()}{extension}" data = io.BytesIO(file_bytes) await asyncio.to_thread( self._client.put_object, self._bucket, object_key, data, length=len(file_bytes), content_type=content_type, ) return object_key async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str: return await asyncio.to_thread( self._client.presigned_get_object, bucket_name=self._bucket, object_name=object_key, expires=timedelta(minutes=expires_minutes), ) async def health_check(self) -> bool: try: return await asyncio.to_thread( self._client.bucket_exists, self._bucket ) except Exception: return False ``` **MinIO `put_object` signature (confirmed):** ```python client.put_object( bucket_name: str, object_name: str, # the object key data: io.RawIOBase, # io.BytesIO is accepted length: int, # -1 with part_size for unknown-length streams content_type: str = "application/octet-stream", ) ``` **Note on `length=-1`:** For unknown-length streams, set `length=-1` and `part_size=10*1024*1024`. For in-memory `io.BytesIO`, always pass `length=len(bytes)` — this avoids a multipart upload when not needed. **Source:** [CITED: github.com/minio/minio-py/blob/master/docs/API.md] --- ### Pattern 4: MinIO Bucket Initialization at Startup **What:** On first `docker compose up`, MinIO starts with an empty state. The application must create the `docuvault` bucket if it doesn't exist. This is done in the FastAPI lifespan, not in user request handlers. **Example:** ```python # main.py lifespan extension @asynccontextmanager async def lifespan(app: FastAPI): # PostgreSQL engine + pool # MinIO bucket initialization minio_client = Minio( settings.minio_endpoint, access_key=settings.minio_access_key, secret_key=settings.minio_secret_key, secure=False, ) exists = await asyncio.to_thread(minio_client.bucket_exists, settings.minio_bucket) if not exists: await asyncio.to_thread(minio_client.make_bucket, settings.minio_bucket) app.state.minio = minio_client yield await engine.dispose() ``` --- ### Pattern 5: Celery App + Redis Broker Configuration **What:** A single `celery_app.py` module defines the Celery application. Tasks are defined as decorated functions. FastAPI route handlers call `.delay()` to enqueue; the celery-worker container processes them. **Redis URL format (with password, Docker internal network):** ``` redis://:${REDIS_PASSWORD}@redis:6379/0 ``` The `:` before the password with no username is the correct format when Redis is configured with `requirepass` but no ACL users. [CITED: docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html via WebSearch] **Example:** ```python # celery_app.py import os from celery import Celery celery_app = Celery("docuvault") celery_app.conf.broker_url = os.environ.get("REDIS_URL", "redis://redis:6379/0") celery_app.conf.result_backend = os.environ.get("REDIS_URL", "redis://redis:6379/0") celery_app.conf.task_serializer = "json" celery_app.conf.result_serializer = "json" celery_app.conf.accept_content = ["json"] celery_app.conf.task_routes = { "tasks.document_tasks.*": {"queue": "documents"}, } # tasks/document_tasks.py from celery_app import celery_app @celery_app.task(name="tasks.document_tasks.extract_and_classify") def extract_and_classify(document_id: str) -> dict: # Celery tasks are SYNCHRONOUS functions — do NOT use async def here. # Use asyncio.run() sparingly or run sync equivalents of extractor/classifier. from services import extractor, classifier ... # api/documents.py — calling the task from tasks.document_tasks import extract_and_classify @router.post("/upload") async def upload_document(...): ... # Replace: background_tasks.add_task(classifier.classify_document, doc_id) # With: extract_and_classify.delay(str(saved_doc.id)) return meta ``` **Critical: Celery tasks are synchronous.** The Celery worker runs a standard Python event loop (not asyncio). Calling `async def` functions inside a Celery task requires `asyncio.run()`, which creates a new event loop per task invocation. This is acceptable for Phase 1 since the existing `extractor.py` and `classifier.py` services already have sync and async entry points, but keep tasks pure-sync where possible. [VERIFIED via WebSearch cross-checked with official docs] **Worker startup command:** ``` celery -A celery_app worker --loglevel=info -Q documents ``` --- ### Pattern 6: Docker Compose Health Checks + `depends_on` **What:** Each infrastructure service has a `healthcheck` definition. The `backend` service uses `depends_on: condition: service_healthy` to wait for all three (postgres, minio, redis) before starting. **Example:** ```yaml services: postgres: image: postgres:17-alpine environment: POSTGRES_DB: docuvault POSTGRES_USER: postgres POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} volumes: - postgres_data:/var/lib/postgresql/data - ./docker/postgres/initdb.d:/docker-entrypoint-initdb.d:ro healthcheck: test: ["CMD-SHELL", "pg_isready -U postgres -d docuvault"] interval: 10s timeout: 5s retries: 5 start_period: 10s minio: image: minio/minio:latest command: server /data --console-address ":9001" environment: MINIO_ROOT_USER: ${MINIO_ROOT_USER} MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD} ports: - "9000:9000" - "9001:9001" volumes: - minio_data:/data healthcheck: # curl is removed from recent MinIO images; use the /minio/health/live HTTP endpoint # from the host. Inside the container, mc is available: test: ["CMD", "mc", "ready", "local"] interval: 10s timeout: 5s retries: 5 start_period: 15s redis: image: redis:7-alpine command: redis-server --requirepass ${REDIS_PASSWORD} healthcheck: test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"] interval: 10s timeout: 3s retries: 5 backend: depends_on: postgres: condition: service_healthy minio: condition: service_healthy redis: condition: service_healthy ``` **MinIO healthcheck note:** `curl` was removed from MinIO's Docker image in October 2023. The `mc ready local` command is the current recommended healthcheck inside the container. The `/minio/health/live` HTTP endpoint (returns 200 OK) is still valid for external probing but cannot be used inside the container without curl. [CITED: github.com/minio/minio/issues/18389] --- ### Pattern 7: PostgreSQL Two-User Init Script **What:** The official PostgreSQL Docker image runs scripts in `/docker-entrypoint-initdb.d/` on first start (empty volume). A SQL script provisions two users: `docuvault_migrate` (DDL) and `docuvault_app` (runtime, restricted). **When to use:** First `docker compose up` with a fresh volume. Idempotent for re-runs is not required — init scripts only run once. **Example:** ```sql -- docker/postgres/initdb.d/01-init-users.sql -- Runs as the POSTGRES_USER superuser on first container start only. -- Migration user: DDL privileges (CREATE TABLE, ALTER TABLE, CREATE INDEX) CREATE USER docuvault_migrate WITH PASSWORD 'PLACEHOLDER_MIGRATE_PASSWORD'; GRANT ALL PRIVILEGES ON DATABASE docuvault TO docuvault_migrate; -- App user: runtime DML only (SELECT, INSERT, UPDATE, DELETE) CREATE USER docuvault_app WITH PASSWORD 'PLACEHOLDER_APP_PASSWORD'; GRANT CONNECT ON DATABASE docuvault TO docuvault_app; -- Grant schema-level privileges AFTER migration user creates the schema -- This must run after alembic upgrade head, OR grant in a second script. -- Pattern: grant via a post-migration step or grant within the migration itself: -- GRANT USAGE ON SCHEMA public TO docuvault_app; -- GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO docuvault_app; -- ALTER DEFAULT PRIVILEGES IN SCHEMA public -- GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO docuvault_app; ``` **Important:** The `GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES` must be run AFTER Alembic has created the tables, because `ON ALL TABLES` applies only to existing tables. Use `ALTER DEFAULT PRIVILEGES` so future tables (from future migrations) are also accessible. This can be done at the end of the first Alembic migration file, or in a post-migration Docker entrypoint hook. **Recommended approach for Phase 1:** Run the GRANT as the last step of the `0001_initial_schema.py` migration using `op.execute()` as the `docuvault_migrate` user (which has full privileges). [ASSUMED — no official doc confirming this is the standard Alembic pattern, but it follows from standard PostgreSQL privilege management] --- ### Pattern 8: StorageBackend ABC (Mirrors `ai/` Pattern) **What:** `storage/base.py` defines `StorageBackend` as an abstract base class with the same structure as `ai/base.py`. `storage/__init__.py` provides a `get_storage_backend()` factory. `storage/minio_backend.py` is the Phase 1 implementation. **Example:** ```python # storage/base.py from abc import ABC, abstractmethod class StorageBackend(ABC): @abstractmethod async def put_object( self, user_id: str, document_id: str, file_bytes: bytes, extension: str, content_type: str, ) -> str: """Store object; return the object_key used.""" @abstractmethod async def get_object(self, object_key: str) -> bytes: """Retrieve object bytes by key.""" @abstractmethod async def delete_object(self, object_key: str) -> None: """Delete object by key.""" @abstractmethod async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str: """Return a time-limited download URL.""" @abstractmethod async def health_check(self) -> bool: """Return True if backend is reachable.""" # storage/__init__.py from config import settings from storage.minio_backend import MinIOBackend def get_storage_backend() -> StorageBackend: return MinIOBackend( endpoint=settings.minio_endpoint, access_key=settings.minio_access_key, secret_key=settings.minio_secret_key, bucket=settings.minio_bucket, secure=False, ) ``` --- ### Anti-Patterns to Avoid - **Sync SQLAlchemy in async context:** Using `create_engine()` instead of `create_async_engine()` in FastAPI will block the event loop on every database call. Use `create_async_engine` throughout. - **Calling `await session.commit()` then accessing lazy-loaded attributes:** Always set `expire_on_commit=False` or explicitly refresh after commit. - **Connecting Alembic using `DATABASE_URL` (restricted user):** The restricted `docuvault_app` user has no DDL privileges. Alembic migrations will fail with `permission denied` errors. Alembic must always use `DATABASE_MIGRATE_URL`. - **Using `async def` for Celery task functions:** Celery workers do not run an asyncio event loop. Define tasks as `def`, not `async def`. Wrap any async calls with `asyncio.run()` if unavoidable, but prefer sync implementations in tasks. - **Storing human-readable filename as MinIO object key:** Object keys must be UUID-based (`{user_id}/{document_id}/{uuid4()}{ext}`). Filenames are stored ONLY in the `documents.filename` DB column. Putting human filenames in the key enables path traversal and makes key prediction trivial. - **Using `minio_client.bucket_exists()` inside async handlers without `asyncio.to_thread`:** The MinIO SDK is synchronous; calling it directly from `async def` will block the event loop. - **MinIO `mc ready local` healthcheck with a password-protected Redis `redis-cli ping`:** For Redis with `requirepass`, the healthcheck must pass `-a $REDIS_PASSWORD` to `redis-cli`. A bare `redis-cli ping` will return `NOAUTH` and be treated as unhealthy. --- ## Don't Hand-Roll | Problem | Don't Build | Use Instead | Why | |---------|-------------|-------------|-----| | Async PostgreSQL session management | Custom connection/context manager | SQLAlchemy `async_sessionmaker` + `Depends(get_db)` | Handles connection pooling, transaction boundaries, error cleanup, and the `expire_on_commit` edge case | | Database schema migrations | Manual `CREATE TABLE` scripts in Python | Alembic | Manages migration history, rollbacks, auto-generation from ORM models, and multi-environment DSN configuration | | MinIO object lifecycle | Custom S3-like HTTP client | `minio` Python SDK | Handles multipart uploads, signature v4, presigned URL expiry, retry logic, and connection pooling | | Background task distribution | Thread pools or `asyncio.create_task()` | Celery + Redis | Cross-instance task distribution, retry on failure, dead letter queues, task result storage | | Docker service ordering | `sleep` commands in Compose entrypoints | `healthcheck` + `depends_on: condition: service_healthy` | Deterministic, declarative; `sleep` is a race condition | | PostgreSQL privilege management | Per-table GRANT scripts written by hand | `ALTER DEFAULT PRIVILEGES` in Alembic migration | Future migrations automatically inherit privileges; hand-written grants go stale | **Key insight:** The existing `filelock`-based `services/storage.py` uses at least 6 custom concurrency primitives to solve problems that PostgreSQL's transaction isolation and MinIO's atomic object operations solve at the infrastructure level. The rewrite simplifies the code while gaining correctness guarantees. --- ## Common Pitfalls ### Pitfall 1: `expire_on_commit=True` (the default) Causes `MissingGreenlet` **What goes wrong:** After `await session.commit()`, accessing any ORM object attribute triggers a new SELECT query. In async context, if there is no active session scope, SQLAlchemy raises `sqlalchemy.exc.MissingGreenlet: greenlet_spawn has not been called`. **Why it happens:** The default `Session.expire_on_commit=True` marks objects as "expired" post-commit. The next attribute access triggers a lazy load, which needs a sync greenlet context (not available in asyncio). **How to avoid:** Always set `expire_on_commit=False` in `async_sessionmaker`. [CITED: docs.sqlalchemy.org] **Warning signs:** `MissingGreenlet` in tracebacks after commit; attribute access on model instances outside `async with session` blocks. --- ### Pitfall 2: Alembic `env.py` Not Importing All Models **What goes wrong:** `alembic revision --autogenerate` generates an empty migration even though models were defined. **Why it happens:** Alembic's `target_metadata` must be set to `Base.metadata`, and all model modules must be imported BEFORE `target_metadata` is accessed in `env.py`. Python only knows about models that have been imported. **How to avoid:** In `migrations/env.py`, explicitly import all model modules: ```python from db import models # noqa: F401 — must import to register with Base.metadata target_metadata = models.Base.metadata ``` **Warning signs:** Empty `op.` blocks in generated migrations; tables not appearing in migration history. --- ### Pitfall 3: MinIO `put_object` Requires `io.BytesIO.seek(0)` Before Use **What goes wrong:** `put_object` reads 0 bytes if the `io.BytesIO` object's file pointer is at the end (e.g., after writing to it). **Why it happens:** `io.BytesIO.write()` advances the pointer to the end of the data. `put_object` starts reading from the current position. **How to avoid:** Always call `data.seek(0)` before passing a `BytesIO` to `put_object`. Or construct the `BytesIO` from the complete bytes directly: `io.BytesIO(file_bytes)` starts the pointer at 0. **Warning signs:** MinIO reports successful upload but object is 0 bytes; or `OSError: stream having not enough data`. --- ### Pitfall 4: PostgreSQL Init Script GRANT Timing **What goes wrong:** `docuvault_app` user gets `permission denied` on tables even after `GRANT ... ON ALL TABLES`. **Why it happens:** `GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public` only applies to tables that exist at the time of the GRANT. Tables created by Alembic after the init script runs are not covered. **How to avoid:** Run `ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO docuvault_app;` in the Alembic initial migration (as `docuvault_migrate` user, which owns the tables). This covers all future tables created by the same migration user. **Warning signs:** First `docker compose up` works; second run after `alembic upgrade head` fails with 403 DB errors. --- ### Pitfall 5: Redis Healthcheck Without Authentication **What goes wrong:** `redis-cli ping` returns `NOAUTH Authentication required` when Redis is started with `requirepass`. Docker Compose treats non-zero exit as unhealthy. Backend never starts. **Why it happens:** `redis-cli ping` without `-a` doesn't pass the password. **How to avoid:** Use `redis-cli -a ${REDIS_PASSWORD} ping` in the healthcheck `test` field. Note that this logs a warning about passing password on command line — acceptable for a healthcheck, not for production scripts. **Warning signs:** `backend` service stuck at `Waiting for redis to be healthy`; `redis-cli ping` showing `NOAUTH` in container logs. --- ### Pitfall 6: MinIO `mc ready local` Healthcheck Not Available Without `mc` **What goes wrong:** `mc` is present in the official `minio/minio` Docker image, so `mc ready local` works as a healthcheck. If using a third-party or stripped MinIO image, `mc` may be absent. **How to avoid:** Stick to the official `minio/minio:latest` image. If a custom image is needed, use the `/minio/health/live` HTTP endpoint probed from a sidecar or from the host — not from inside the container without curl. --- ### Pitfall 7: Celery Worker Cannot Import FastAPI App Module **What goes wrong:** Celery worker Docker container imports `celery_app.py`, which transitively imports the FastAPI app or lifespan, which tries to open database connections or access `app.state`. **Why it happens:** Shared imports between the FastAPI app and Celery tasks create circular dependencies at module load time. **How to avoid:** Keep `celery_app.py` minimal (Celery configuration only). Task functions in `tasks/` import services directly, not via `main.py` or any router. The Celery worker starts with `celery -A celery_app worker` — it never starts FastAPI. --- ## Code Examples ### Full v1 SQLAlchemy ORM Schema (Phase 1 Migration Target) ```python # db/models.py import uuid from datetime import datetime, timezone from sqlalchemy import ( Boolean, BigInteger, ForeignKey, Index, String, Text, TIMESTAMP, UniqueConstraint, Integer ) from sqlalchemy.dialects.postgresql import UUID, INET, JSONB from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column, relationship from sqlalchemy.sql import func def now_utc(): return datetime.now(timezone.utc) class Base(DeclarativeBase): pass class User(Base): __tablename__ = "users" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) handle: Mapped[str] = mapped_column(String, unique=True, nullable=False) email: Mapped[str] = mapped_column(String, unique=True, nullable=False) password_hash: Mapped[str] = mapped_column(Text, nullable=False) totp_secret: Mapped[str | None] = mapped_column(Text, nullable=True) totp_enabled: Mapped[bool] = mapped_column(Boolean, nullable=False, default=False) role: Mapped[str] = mapped_column(String, nullable=False, default="user") is_active: Mapped[bool] = mapped_column(Boolean, nullable=False, default=True) ai_provider: Mapped[str | None] = mapped_column(Text, nullable=True) ai_model: Mapped[str | None] = mapped_column(Text, nullable=True) default_storage_backend: Mapped[str] = mapped_column(String, nullable=False, default="minio") created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) class Quota(Base): __tablename__ = "quotas" user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), primary_key=True) limit_bytes: Mapped[int] = mapped_column(BigInteger, nullable=False, default=104857600) # 100 MB used_bytes: Mapped[int] = mapped_column(BigInteger, nullable=False, default=0) class RefreshToken(Base): __tablename__ = "refresh_tokens" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False) token_hash: Mapped[str] = mapped_column(Text, unique=True, nullable=False) expires_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False) revoked: Mapped[bool] = mapped_column(Boolean, nullable=False, default=False) created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) __table_args__ = (Index("ix_refresh_tokens_user_revoked", "user_id", "revoked"),) class Folder(Base): __tablename__ = "folders" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False) parent_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("folders.id", ondelete="CASCADE"), nullable=True) name: Mapped[str] = mapped_column(Text, nullable=False) created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) __table_args__ = (UniqueConstraint("user_id", "parent_id", "name"),) class Document(Base): __tablename__ = "documents" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) # user_id is NULLABLE in Phase 1 (D-03); Phase 2 migration adds NOT NULL user_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=True) folder_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("folders.id", ondelete="SET NULL"), nullable=True) filename: Mapped[str] = mapped_column(Text, nullable=False) # original human-readable name object_key: Mapped[str] = mapped_column(Text, nullable=False) # MinIO key: {user_id}/{doc_id}/{uuid4}{ext} content_type: Mapped[str] = mapped_column(Text, nullable=False) size_bytes: Mapped[int] = mapped_column(BigInteger, nullable=False, default=0) storage_backend: Mapped[str] = mapped_column(String, nullable=False, default="minio") extracted_text: Mapped[str | None] = mapped_column(Text, nullable=True) status: Mapped[str] = mapped_column(String, nullable=False, default="pending") created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) updated_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) __table_args__ = ( Index("ix_documents_user_folder", "user_id", "folder_id"), Index("ix_documents_user_created", "user_id", "created_at"), ) class Topic(Base): __tablename__ = "topics" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) user_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=True) name: Mapped[str] = mapped_column(Text, nullable=False) description: Mapped[str] = mapped_column(Text, nullable=False, default="") color: Mapped[str] = mapped_column(String(7), nullable=False, default="#6366f1") __table_args__ = (UniqueConstraint("user_id", "name"),) class DocumentTopic(Base): __tablename__ = "document_topics" document_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("documents.id", ondelete="CASCADE"), primary_key=True) topic_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("topics.id", ondelete="CASCADE"), primary_key=True) class Share(Base): __tablename__ = "shares" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) document_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("documents.id", ondelete="CASCADE"), nullable=False) owner_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False) recipient_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False) permission: Mapped[str] = mapped_column(String, nullable=False, default="view") created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) __table_args__ = ( UniqueConstraint("document_id", "recipient_id"), Index("ix_shares_recipient", "recipient_id"), ) class AuditLog(Base): __tablename__ = "audit_log" id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True) user_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="SET NULL"), nullable=True) actor_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="SET NULL"), nullable=True) event_type: Mapped[str] = mapped_column(Text, nullable=False) resource_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), nullable=True) ip_address: Mapped[str | None] = mapped_column(INET, nullable=True) metadata: Mapped[dict | None] = mapped_column(JSONB, nullable=True) created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) __table_args__ = ( Index("ix_audit_user_created", "user_id", "created_at"), Index("ix_audit_event_created", "event_type", "created_at"), ) class CloudConnection(Base): __tablename__ = "cloud_connections" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False) provider: Mapped[str] = mapped_column(String, nullable=False) display_name: Mapped[str] = mapped_column(Text, nullable=False) credentials_enc: Mapped[str] = mapped_column(Text, nullable=False) status: Mapped[str] = mapped_column(String, nullable=False, default="ACTIVE") connected_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) __table_args__ = (Index("ix_cloud_connections_user", "user_id"),) class Group(Base): """v2 stub — empty table, seeded for schema completeness (PROJECT.md).""" __tablename__ = "groups" id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4) name: Mapped[str] = mapped_column(Text, unique=True, nullable=False) created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now()) ``` --- ### Config Extension for New Env Vars ```python # config.py (extended) from pydantic_settings import BaseSettings class Settings(BaseSettings): # Existing data_dir: str = "/app/data" # Phase 1 additions database_url: str = "postgresql+psycopg://docuvault_app:changeme@postgres/docuvault" database_migrate_url: str = "postgresql+psycopg://docuvault_migrate:changeme@postgres/docuvault" minio_endpoint: str = "minio:9000" minio_access_key: str = "docuvault_app" minio_secret_key: str = "changeme" minio_bucket: str = "docuvault" redis_url: str = "redis://:changeme@redis:6379/0" secret_key: str = "CHANGEME" # documented for Phase 2; not read in Phase 1 class Config: env_file = ".env" env_file_encoding = "utf-8" settings = Settings() ``` --- ## State of the Art | Old Approach | Current Approach | When Changed | Impact | |--------------|------------------|--------------|--------| | `asyncpg` as the only async PostgreSQL dialect | `psycopg` v3 supports both sync + async via one package | 2022 (psycopg v3 release) | Single driver for Alembic + FastAPI; no separate sync/async packages | | `alembic init` (sync template) | `alembic init -t async` for async engine migrations | Alembic 1.7+ | env.py template pre-configured for asyncio; no manual async wiring | | `async_sessionmaker` equivalent was `sessionmaker` with separate import | `async_sessionmaker` is a first-class API in SQLAlchemy 2.0 | SQLAlchemy 2.0 (2023) | Cleaner factory pattern without subclassing | | MinIO Docker image included `curl` for healthchecks | `curl` removed from image; `mc ready local` is the new healthcheck | October 2023 | Existing tutorials with `curl -f` healthcheck will silently fail on current images | | `FastAPI BackgroundTasks` for async post-request work | Celery + Redis for distributed, reliable task queues | Ongoing | `BackgroundTasks` is per-instance and has no retry; Celery is cross-instance | **Deprecated/outdated:** - `filelock` dependency: can be removed from `backend/requirements.txt` once `services/storage.py` is replaced (CONCERNS.md item 14 identifies the unused `shutil` import; same cleanup applies to `filelock`). - Per-document `.lock` files in `data/metadata/`: deleted with `data/` directory contents (D-04). - `psycopg2` (old driver): not installed and not needed; `psycopg` v3 is the replacement. - Sync file I/O in async handlers (CONCERNS.md item 6): resolved entirely by switching to async SQLAlchemy. --- ## Assumptions Log | # | Claim | Section | Risk if Wrong | |---|-------|---------|---------------| | A1 | Running `GRANT ... ON ALL TABLES` inside the Alembic initial migration as `docuvault_migrate` is the standard pattern for privilege handoff to `docuvault_app` | Pattern 7 (PostgreSQL init script) | If the migration user lacks permission to GRANT to another user, privileges must be set manually or via a separate script — delays testing | | A2 | The Celery worker container can import `db/models.py` and `services/` directly without starting FastAPI (no circular import) | Pattern 5 (Celery) | If service modules import FastAPI components at module level, a refactor is needed before worker tasks can import services | | A3 | `minio/minio:latest` Docker image includes `mc` for the `mc ready local` healthcheck | Pattern 6 (Docker Compose) | If `mc` is not in the image, healthcheck must use a shell-based TCP probe or alternative; confirmed via GitHub issue discussion [CITED: github.com/minio/minio/issues/18389] but version-specific | --- ## Open Questions 1. **PostgreSQL version to pin in Docker Compose** - What we know: Any PostgreSQL 14+ supports `gen_random_uuid()`, `JSONB`, `INET`, and `TIMESTAMPTZ` used in the schema. - What's unclear: Whether to use `postgres:16`, `postgres:17`, or `postgres:17-alpine`. - Recommendation: Use `postgres:17-alpine` (smallest image, current stable, alpine is well-suited for Docker Compose dev setups). 2. **MinIO version pinning** - What we know: `minio/minio:latest` has `mc` available for healthchecks; `curl` was removed in late 2023. - What's unclear: Whether to pin to a specific release tag (e.g., `RELEASE.2025-09-07T16-13-09Z`) or use `:latest`. - Recommendation: Pin to a specific RELEASE tag for reproducibility; update as part of a maintenance task. [ASSUMED — no strong official guidance on whether `:latest` is appropriate for production-adjacent Docker Compose] 3. **Topics table migration: existing topic names from `data/topics.json`** - What we know: D-04 deletes `data/` contents. Topics stored in `topics.json` are test data and are deleted. - What's unclear: The existing `api/topics.py` and `frontend/src/stores/topics.js` need updating to read from PostgreSQL instead of the flat file. The API shape should remain the same (list of objects with `id`, `name`, `description`, `color`). - Recommendation: The planner must include a task for updating `api/topics.py` to use async SQLAlchemy ORM queries against the `topics` table. 4. **Celery task vs direct service call for text extraction + classification** - What we know: The current `api/documents.py` calls `await classifier.classify_document()` inside the route handler. This needs to move to a Celery task. - What's unclear: Whether Phase 1 should move ALL of extraction + classification into a Celery task (full async flow) or just wire up the infrastructure with a placeholder task and migrate the logic in Phase 3. - Recommendation: Phase 1 should wire the full task (extract + classify) in Celery — the walking skeleton requirement says "AI classification workflow completes successfully." A placeholder task that doesn't classify would fail the success criteria. --- ## Environment Availability | Dependency | Required By | Available | Version | Fallback | |------------|------------|-----------|---------|----------| | Docker | Docker Compose services | ✓ | 29.5.0 | — | | Python 3.12 | Backend (in Docker image) | ✓ (host: 3.14.5; Docker: 3.12 pinned) | 3.12 in image | — | | PostgreSQL (via Docker) | Database tier | ✓ (via Docker) | 17 (image) | — | | MinIO (via Docker) | Object storage | ✓ (via Docker) | latest | — | | Redis (via Docker) | Celery broker, Phase 2 rate limiting | ✓ (via Docker) | 7-alpine | — | | pytest | Backend test runner | ✓ (host pip3) | existing | — | **Missing dependencies with no fallback:** None. **Missing dependencies with fallback:** None. --- ## Validation Architecture ### Test Framework | Property | Value | |----------|-------| | Framework | pytest with pytest-asyncio (existing) | | Config file | `backend/pytest.ini` (existing; `asyncio_mode = auto`) | | Quick run command | `cd backend && pytest tests/test_health.py tests/test_documents.py tests/test_storage.py -x` | | Full suite command | `cd backend && pytest -v` | ### Phase Requirements → Test Map | Req ID | Behavior | Test Type | Automated Command | File Exists? | |--------|----------|-----------|-------------------|-------------| | STORE-01 | Upload stores metadata in PostgreSQL and bytes in MinIO | integration | `pytest tests/test_documents.py::test_upload_stores_to_postgres_and_minio -x` | ❌ Wave 0 | | STORE-01 | List documents reads from PostgreSQL (not filesystem) | integration | `pytest tests/test_documents.py::test_list_reads_from_db -x` | ❌ Wave 0 | | STORE-02 | MinIO object key matches `{user_id}/{document_id}/{uuid4}{ext}` pattern | unit | `pytest tests/test_storage.py::test_object_key_schema -x` | ❌ Wave 0 | | STORE-02 | Human-readable filename is NOT in the object key | unit | `pytest tests/test_storage.py::test_filename_not_in_object_key -x` | ❌ Wave 0 | | STORE-07 | `/health` returns PostgreSQL + MinIO connectivity (not just `{"status": "ok"}`) | smoke | `pytest tests/test_health.py::test_health_checks_postgres_and_minio -x` | ❌ Wave 0 | | STORE-07 (implicit) | Storage service has no file locks; concurrent uploads do not corrupt state | integration | `pytest tests/test_documents.py::test_concurrent_uploads -x` | ❌ Wave 0 | ### Sampling Rate - **Per task commit:** `cd backend && pytest tests/test_health.py tests/test_storage.py -x` - **Per wave merge:** `cd backend && pytest -v` - **Phase gate:** Full suite green before `/gsd:verify-work` ### Wave 0 Gaps - [ ] `tests/test_storage.py` — covers STORE-02 (object key schema, filename isolation) - [ ] `tests/test_documents.py` — extend for PostgreSQL/MinIO-backed upload/list (STORE-01) - [ ] `tests/test_health.py` — extend for PostgreSQL + MinIO connectivity probes (STORE-07) - [ ] `tests/conftest.py` — add async engine + session fixtures; add MinIO mock or test bucket fixture - [ ] Update `tests/conftest.py` to monkeypatch `db/session.py` paths (not just `config.py` paths) **Existing tests:** `test_documents.py`, `test_topics.py`, `test_settings.py` test the OLD flat-file storage layer. They will break after `services/storage.py` is replaced. These must be ported (not deleted) as part of Phase 1. --- ## Security Domain ### Applicable ASVS Categories | ASVS Category | Applies | Standard Control | |---------------|---------|-----------------| | V2 Authentication | No — Phase 1 has no auth | Phase 2 | | V3 Session Management | No — Phase 1 has no sessions | Phase 2 | | V4 Access Control | Partial — object key isolation in MinIO backend | `user_id` prefix enforced in `MinIOBackend.put_object()` | | V5 Input Validation | Yes — file upload content type + size | Existing `ALLOWED_MIME_TYPES` enforcement (currently unenforced per CONCERNS.md item 1) | | V6 Cryptography | No — Phase 1 has no credential encryption | Phase 5 | ### Known Threat Patterns for This Phase | Pattern | STRIDE | Standard Mitigation | |---------|--------|---------------------| | Object key prediction / path traversal | Tampering | UUID-based object keys (`{user_id}/{document_id}/{uuid4}{ext}`); never accept object keys from request parameters | | Database superuser credentials in app DSN | Elevation of Privilege | Two-DSN pattern: `docuvault_app` (restricted) for runtime, `docuvault_migrate` (DDL) for Alembic only | | MinIO credentials with bucket admin rights | Elevation of Privilege | App-level access key pair (`MINIO_ACCESS_KEY` / `MINIO_SECRET_KEY`) with read/write on `docuvault` bucket only; root credentials not used by app | | Redis unauthenticated in Docker network | Information Disclosure | `requirepass` set on Redis; `REDIS_URL` includes password; Celery broker and app use authenticated URL | | SQL injection via ORM | Tampering | SQLAlchemy ORM / parameterized queries throughout; zero raw string interpolation (matches CLAUDE.md SEC-03) | | Sensitive data in MinIO object key | Information Disclosure | Human-readable filenames stored in DB only; object key is UUID-based and non-predictable | --- ## Sources ### Primary (HIGH confidence) - [docs.sqlalchemy.org/en/20/orm/extensions/asyncio.html](https://docs.sqlalchemy.org/en/20/orm/extensions/asyncio.html) — async engine setup, `async_sessionmaker`, `expire_on_commit=False`, FastAPI lifespan integration - [alembic.sqlalchemy.org/en/latest/cookbook.html#using-asyncio-with-alembic](https://alembic.sqlalchemy.org/en/latest/cookbook.html) — async `env.py` pattern - [github.com/sqlalchemy/alembic/blob/main/alembic/templates/async/env.py](https://github.com/sqlalchemy/alembic/blob/main/alembic/templates/async/env.py) — official async env.py template code - [github.com/minio/minio-py/blob/master/docs/API.md](https://github.com/minio/minio-py/blob/master/docs/API.md) — `put_object`, `presigned_get_object`, constructor signatures - [github.com/minio/minio/issues/18389](https://github.com/minio/minio/issues/18389) — `curl` removal from MinIO image; `mc ready local` as replacement - [docs.min.io/enterprise/aistor-object-store/operations/monitoring/healthcheck-probe/](https://docs.min.io/enterprise/aistor-object-store/operations/monitoring/healthcheck-probe/) — `/minio/health/live` endpoint documented - [docs.docker.com/reference/compose-file/services/#healthcheck](https://docs.docker.com/reference/compose-file/services/#healthcheck) — `healthcheck` + `depends_on: condition: service_healthy` syntax ### Secondary (MEDIUM confidence) - [docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html](https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html) — Redis URL format verified via WebSearch; Celery docs site was unreachable during research session - [testdriven.io/blog/fastapi-and-celery/](https://testdriven.io/blog/fastapi-and-celery/) — Celery + FastAPI project structure and `.delay()` pattern - WebSearch results cross-referenced with official docs for psycopg install extras, Redis broker URL format, PostgreSQL init script pattern ### Tertiary (LOW confidence) - None — all key claims cross-verified with at least one authoritative source --- ## Metadata **Confidence breakdown:** - Standard stack: HIGH — all packages verified on PyPI via `pip3 index versions`, slopcheck [OK] for all 6 core packages - Architecture: HIGH — patterns drawn from SQLAlchemy official docs, Alembic official template, and MinIO official GitHub - Pitfalls: HIGH — each pitfall sourced from official documentation or confirmed GitHub issues (not community blog posts only) - Celery configuration: MEDIUM — Celery docs site was unreachable; URL format cross-verified via WebSearch + community sources **Research date:** 2026-05-21 **Valid until:** 2026-06-21 for stable stack; MinIO healthcheck pattern should be re-verified if the Docker image version changes significantly