# Phase 1: Infrastructure Foundation - Research
**Researched:** 2026-05-21
**Domain:** PostgreSQL + MinIO + Redis + Celery wired into FastAPI via Docker Compose; Alembic async migrations; storage service rewrite
**Confidence:** HIGH
---
## User Constraints (from CONTEXT.md)
### Locked Decisions
**Schema Scope**
- D-01: Phase 1 initial Alembic migration creates the full v1 skeleton — all tables: `users`, `refresh_tokens`, `quotas`, `documents`, `topics`, `folders`, `shares`, `audit_log`, `cloud_connections`. Subsequent phases add data and constraints, not new tables.
- D-02: `groups` table stub included in Phase 1 migration (v2 feature; empty table, correct columns and FKs).
- D-03: `documents.user_id` is nullable in Phase 1 (no auth system yet). Phase 2 migration adds the NOT NULL constraint after the user/auth system is live.
- D-04: Existing `data/` directory contents (flat-file JSON metadata + uploaded files) are deleted in Phase 1. Test data only — no migration script needed.
**App Wiring**
- D-05: Phase 1 switches the storage service layer to PostgreSQL + MinIO. `backend/services/storage.py` is rewritten to use async SQLAlchemy + MinIO SDK. The app does not continue using the filesystem after Phase 1.
- D-06: Single MinIO bucket named `docuvault`. Object keys follow `{user_id}/{document_id}/{uuid4()}{ext}` (STORE-02). Human-readable filenames stored in the `documents.filename` DB column only — never in the MinIO key.
- D-07: `backend/main.py` `/health` endpoint extended to check PostgreSQL + MinIO connectivity (not just `{"status": "ok"}`). Health checks gate `docker compose up` readiness.
**Background Worker**
- D-08: Background task queue: Celery + Redis (STORE-08). FastAPI `BackgroundTasks` replaced.
- D-09: Redis service added to `docker-compose.yml` in Phase 1. Redis doubles as the rate-limiting store for Phase 2 auth endpoints — no second Redis needed later.
- D-10: A `celery-worker` service is added to `docker-compose.yml`. Celery broker and result backend both point to the same Redis instance via `REDIS_URL`.
**Env / Secrets Strategy**
- D-11: `.env` gitignored + `.env.example` committed. `docker-compose.yml` reads vars via `${VAR_NAME}`. `.env.example` has safe placeholder values and comments explaining each variable.
- D-12: Production secrets stored outside the project directory at `/etc/docuvault/env` (`chmod 600`, owned by the service user, not root). `docker-compose.yml` references it via `env_file:`. Documented in deployment notes.
- D-13: Two PostgreSQL DSNs: `DATABASE_URL` (restricted app user `docuvault_app`, SELECT/INSERT/UPDATE/DELETE only; no DDL) and `DATABASE_MIGRATE_URL` (migration user `docuvault_migrate`, DDL privileges; used only by Alembic).
- D-14: PostgreSQL init script in `docker/postgres/initdb.d/` provisions both users on first container start. The app never connects as the PostgreSQL superuser.
- D-15: MinIO vars: `MINIO_ENDPOINT`, `MINIO_ROOT_USER`, `MINIO_ROOT_PASSWORD` (init only), `MINIO_BUCKET` (value: `docuvault`), `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY` (separate app-level access key pair with minimal bucket permissions).
- D-16: Additional vars in Phase 1 `.env.example`: `REDIS_URL`, `SECRET_KEY` (documented now for Phase 2 JWT + HKDF use; app does not read it in Phase 1).
### Claude's Discretion
None — user made explicit choices for all areas.
### Deferred Ideas (OUT OF SCOPE)
None — discussion stayed within phase scope.
---
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|------------------|
| STORE-01 | Platform storage layer migrated from flat-file JSON + local filesystem to PostgreSQL (metadata) + MinIO (objects) | SQLAlchemy 2.0 async ORM + MinIO SDK patterns documented; service rewrite approach confirmed |
| STORE-02 | Each user's MinIO objects use `{user_id}/{document_id}/{uuid4()}{ext}` keys — human-readable filenames stored in DB only | MinIO `put_object()` API confirmed; key schema enforced in model/service layer |
| STORE-07 | Backend is stateless — no per-instance file locks; multiple instances can run behind a load balancer | PostgreSQL atomic UPDATE + Celery + Redis replaces filelock pattern; verified |
---
## Summary
Phase 1 replaces the entire flat-file persistence layer (JSON metadata + local filesystem uploads) with PostgreSQL (via SQLAlchemy 2.0 async ORM) + MinIO (via the official Python SDK) wired into Docker Compose. Redis and a Celery worker are added alongside as the background task queue that replaces FastAPI `BackgroundTasks`, delivering statelessness required by STORE-07. All infrastructure services are health-checked and ordered via `depends_on` conditions so `docker compose up` can be treated as the single operational command. Alembic manages the schema using the async migration template with a two-DSN strategy (restricted app user + DDL migration user). The walking skeleton requirement is satisfied by: the full v1 schema applied via Alembic, one real document upload persisted to PostgreSQL and MinIO through the rewritten storage service, and the `/health` endpoint returning live connectivity checks for all three services.
The existing single-user document upload → text extraction → AI classification workflow continues to work end-to-end after Phase 1. The Vue frontend requires no changes. All API routes and response shapes are preserved.
**Primary recommendation:** Wire infrastructure with Docker Compose health checks first; apply Alembic migration second; rewrite `services/storage.py` third; replace `BackgroundTasks` with Celery tasks last. This ordering allows each layer to be verified before the next is built.
---
## Architectural Responsibility Map
| Capability | Primary Tier | Secondary Tier | Rationale |
|------------|-------------|----------------|-----------|
| Document metadata persistence | Database / Storage (PostgreSQL) | API / Backend | All metadata is authored and read server-side; no client involvement |
| Binary file storage | Database / Storage (MinIO) | API / Backend | Object store owns bytes; backend generates keys and proxies operations |
| Background text extraction + classification | Background Worker (Celery) | API / Backend | CPU-intensive, deferred; must not block HTTP event loop |
| Health checking | API / Backend | Docker Compose | FastAPI `/health` probes PostgreSQL + MinIO; Compose waits on it |
| Schema migrations | Database / Storage (Alembic + PostgreSQL) | — | DDL-only responsibility; executed before app starts |
| Object key namespacing | API / Backend (service layer) | — | Key construction is a code concern, not a storage concern |
| Service ordering / startup sequencing | CDN / Static (Docker Compose) | — | `depends_on: condition: service_healthy` enforces boot order |
| Connection pooling | API / Backend (SQLAlchemy pool) | Database / Storage | App holds pool; PostgreSQL is the pooled resource |
| Task queue / broker | Background Worker (Redis / Celery) | API / Backend | Broker is Redis; workers are separate Docker Compose services |
---
## Standard Stack
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| `sqlalchemy[asyncio]` | `>=2.0.49` | ORM + async engine + connection pool | Industry standard for Python async PostgreSQL; `create_async_engine` + `async_sessionmaker` pattern is the canonical FastAPI integration |
| `psycopg[binary]` | `>=3.3.4` | PostgreSQL async driver | psycopg v3 (`psycopg`) is SQLAlchemy 2.0's preferred async dialect; `[binary]` provides pre-built wheels with no system dependency on libpq headers |
| `alembic` | `>=1.18.4` | Database migrations | The only maintained migration tool for SQLAlchemy; provides async template (`alembic init -t async`) |
| `minio` | `>=7.2.20` | MinIO / S3 object storage SDK | Official MinIO Python SDK; stable API for `put_object`, `get_object`, `bucket_exists`, `presigned_get_object` |
| `celery[redis]` | `>=5.6.3` | Background task queue + Redis transport | Battle-tested distributed task queue; `[redis]` extra installs `redis` client; replaces per-instance `BackgroundTasks` |
| `redis` | `>=7.4.0` | Redis Python client (Celery dependency + Phase 2 rate limiting) | Official Redis client; installed transitively by `celery[redis]` but worth pinning for Phase 2 rate limiting use |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| `pydantic-settings` | `>=2.2` | Env var configuration (already in project) | Extended with new DATABASE_URL, MINIO_*, REDIS_URL vars |
| `anyio` | `>=4.13.0` | Async testing utilities | Required by `httpx` for async test transport in pytest |
| `httpx` | `>=0.28.1` | Async HTTP client for integration tests | Needed to replace `TestClient` (sync) with `AsyncClient` for async route testing |
| `pytest-asyncio` | `>=1.3.0` | Async test runner integration | Already in project as `>=0.23`; upgrade to `>=1.3.0` for `asyncio_mode = auto` support in new async tests |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| `psycopg[binary]` | `asyncpg` | `asyncpg` is faster in benchmarks but requires a separate sync driver (`psycopg2`) for Alembic. `psycopg` v3 works for both sync (Alembic) and async (FastAPI) with the same URL — zero driver switching |
| `celery[redis]` | `pgqueuer` / `pg_boss` | pgqueuer uses PostgreSQL as the queue (no Redis required). However, the user explicitly selected Celery + Redis. Redis is also needed in Phase 2 for rate limiting, so Redis is justified regardless |
| `minio` Python SDK (sync, wrapped in `asyncio.to_thread`) | `aiobotocore` | MinIO SDK is the official client with full API coverage including MinIO-specific features. `aiobotocore` is AWS-oriented and less tested with MinIO-specific APIs. `to_thread()` wrapping is the correct async pattern for the sync SDK |
**Installation (backend/requirements.txt additions):**
```
sqlalchemy[asyncio]>=2.0
psycopg[binary]>=3.3
alembic>=1.13
minio>=7.2
celery[redis]>=5.4
redis>=7.0
httpx>=0.27
pytest-asyncio>=0.23
```
Note: `psycopg[binary]` is specified with bracket extras in requirements.txt. The binary extra installs a self-contained wheel — no system `libpq-dev` package required in the Docker image, simplifying the Dockerfile.
---
## Package Legitimacy Audit
All packages verified on PyPI registry via `pip3 index versions` and `slopcheck install` (v0.6.1, run 2026-05-21).
| Package | Registry | Age | Downloads | Source Repo | slopcheck | Disposition |
|---------|----------|-----|-----------|-------------|-----------|-------------|
| `sqlalchemy` | PyPI | ~20 yrs | Very high (millions/wk) | github.com/sqlalchemy/sqlalchemy | OK | Approved |
| `psycopg` | PyPI | ~4 yrs (v3) | High | github.com/psycopg/psycopg | OK | Approved |
| `alembic` | PyPI | ~12 yrs | Very high | github.com/sqlalchemy/alembic | OK | Approved |
| `minio` | PyPI | ~8 yrs | High | github.com/minio/minio-py | OK | Approved |
| `celery` | PyPI | ~15 yrs | Very high (millions/wk) | github.com/celery/celery | OK | Approved |
| `redis` | PyPI | ~12 yrs | Very high | github.com/redis/redis-py | OK | Approved |
**Packages removed due to slopcheck [SLOP] verdict:** none
**Packages flagged as suspicious [SUS]:** none
Note: `psycopg[binary]` is specified with extras syntax in requirements.txt; the installable wheel is `psycopg-binary` on PyPI, which also passed registry verification (version 3.3.4 confirmed). [VERIFIED: PyPI registry + slopcheck OK]
---
## Architecture Patterns
### System Architecture Diagram
```
Browser (Vue 3 SPA — unchanged in Phase 1)
│ HTTP/JSON + multipart (same API contract)
▼
FastAPI (port 8000) — lifespan creates async engine, disposes on shutdown
│
├── api/documents.py ─── calls ──► services/storage.py (REWRITTEN)
│ │
│ ├─► db/session.py (AsyncSession)
│ │ │
│ │ ▼
│ │ PostgreSQL (port 5432)
│ │ [docuvault_app user, restricted]
│ │
│ └─► storage/minio_backend.py
│ │
│ ▼
│ MinIO (port 9000)
│ [bucket: docuvault]
│ [app-level access key]
│
├── /health ─── probes ──► PostgreSQL + MinIO connectivity
│
└── celery_app.py ─── enqueues tasks ──► Redis (port 6379)
│
Celery Worker (separate container)
├── task: extract_and_classify()
│ ├─► services/extractor.py
│ └─► services/classifier.py
└── consumes from Redis queue
Alembic (run once at deploy time, not part of app startup)
│ uses DATABASE_MIGRATE_URL (docuvault_migrate user, DDL privileges)
└─► PostgreSQL — applies full v1 schema
```
### Recommended Project Structure
```
backend/
├── main.py # FastAPI app; extend lifespan for engine/dispose
├── config.py # pydantic-settings: extend with new env vars
├── celery_app.py # Celery app instance (broker from REDIS_URL)
├── db/
│ ├── __init__.py
│ ├── session.py # async engine + async_sessionmaker
│ └── models.py # all SQLAlchemy ORM models (full v1 schema)
├── deps/
│ └── db.py # get_db() — yields AsyncSession
├── services/
│ ├── storage.py # REPLACED: async SQLAlchemy + MinIO SDK
│ ├── extractor.py # unchanged
│ └── classifier.py # update to accept session; dispatch via Celery
├── storage/ # NEW: StorageBackend ABC + MinIO implementation
│ ├── __init__.py # get_storage_backend() factory
│ ├── base.py # StorageBackend ABC (mirrors ai/base.py)
│ └── minio_backend.py # MinIO implementation
├── tasks/
│ └── document_tasks.py # Celery task definitions (extract_and_classify)
├── migrations/ # Alembic migration directory
│ ├── env.py # async env.py with two-DSN strategy
│ ├── script.py.mako
│ └── versions/
│ └── 0001_initial_schema.py
├── alembic.ini # sqlalchemy.url = DATABASE_MIGRATE_URL
├── api/
│ ├── documents.py # update to use async storage service
│ ├── topics.py # unchanged (topics still in DB after migration)
│ └── settings.py # unchanged
└── tests/
├── conftest.py # UPDATE: add async engine + session fixtures
├── test_health.py # UPDATE: test PostgreSQL + MinIO health probes
├── test_documents.py # UPDATE: adapt for async storage layer
└── test_storage.py # NEW: unit tests for MinIO object key schema
```
### Pattern 1: SQLAlchemy 2.0 Async Engine + Session Factory (FastAPI Lifespan)
**What:** Create engine once at startup, share it application-wide via `app.state`. Session factory (`async_sessionmaker`) yields per-request sessions via a FastAPI dependency.
**When to use:** Any database access in FastAPI route handlers or services.
**Example:**
```python
# db/session.py
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
from config import settings
engine = create_async_engine(
settings.database_url, # postgresql+psycopg://docuvault_app:...@postgres/docuvault
pool_pre_ping=True, # detect stale connections before use
echo=False,
)
AsyncSessionLocal = async_sessionmaker(
engine,
class_=AsyncSession,
expire_on_commit=False, # prevent lazy-load errors after commit
)
# deps/db.py
from db.session import AsyncSessionLocal
async def get_db():
async with AsyncSessionLocal() as session:
try:
yield session
finally:
await session.close()
# main.py — lifespan
from contextlib import asynccontextmanager
from db.session import engine
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: engine creates pool on first connection
yield
# Shutdown: close all pooled connections
await engine.dispose()
app = FastAPI(lifespan=lifespan)
```
**Source:** [CITED: docs.sqlalchemy.org/en/20/orm/extensions/asyncio.html]
**Key detail — URL format for psycopg v3:**
```
postgresql+psycopg://user:password@host:port/dbname
```
The same `postgresql+psycopg://` prefix works for both `create_engine()` (Alembic) and `create_async_engine()` (FastAPI). SQLAlchemy selects the sync or async dialect variant automatically. [CITED: docs.sqlalchemy.org/en/20/dialects/postgresql.html]
**Key detail — `expire_on_commit=False`:** After `session.commit()`, SQLAlchemy marks all objects as expired and would trigger another SELECT on next attribute access. In async context, this causes `MissingGreenlet` errors because there's no active async context at that point. Setting `expire_on_commit=False` prevents this. [CITED: docs.sqlalchemy.org/en/20/orm/extensions/asyncio.html]
---
### Pattern 2: Alembic Async Configuration with Two DSNs
**What:** Alembic's async template (`alembic init -t async`) generates `env.py` that uses `async_engine_from_config` and `asyncio.run()`. The `DATABASE_MIGRATE_URL` DSN (DDL privileges) is used only by Alembic; the app uses `DATABASE_URL` (restricted). This separates migration risk from runtime risk.
**When to use:** Every `alembic upgrade head` call. Never used by FastAPI directly.
**Example:**
```python
# migrations/env.py (key section — async online migrations)
import asyncio
from sqlalchemy.ext.asyncio import async_engine_from_config
from sqlalchemy import pool
from alembic import context
from db.models import Base # import all models so metadata is populated
target_metadata = Base.metadata
def do_run_migrations(connection):
context.configure(connection=connection, target_metadata=target_metadata)
with context.begin_transaction():
context.run_migrations()
async def run_async_migrations():
connectable = async_engine_from_config(
config.get_section(config.config_ini_section, {}),
prefix="sqlalchemy.",
poolclass=pool.NullPool, # migrations use per-run connection, not pool
)
async with connectable.connect() as connection:
await connection.run_sync(do_run_migrations)
await connectable.dispose()
def run_migrations_online():
asyncio.run(run_async_migrations())
```
```ini
# alembic.ini
[alembic]
script_location = migrations
sqlalchemy.url = %(DATABASE_MIGRATE_URL)s # reads from env via %(VAR)s interpolation
```
**Two-DSN in practice:** The `alembic.ini` `sqlalchemy.url` references `DATABASE_MIGRATE_URL`. FastAPI's `db/session.py` reads `DATABASE_URL`. Both are set in `.env`. The Docker Compose `backend` service has both env vars; the `celery-worker` service has `DATABASE_URL` only (workers need no DDL).
**Source:** [CITED: alembic.sqlalchemy.org/en/latest/cookbook.html#using-asyncio-with-alembic] + [CITED: github.com/sqlalchemy/alembic/blob/main/alembic/templates/async/env.py]
---
### Pattern 3: MinIO SDK Sync-in-Async via `asyncio.to_thread()`
**What:** The MinIO Python SDK is synchronous. In an async FastAPI context, blocking I/O blocks the event loop. Wrap MinIO SDK calls in `asyncio.to_thread()` to offload to a thread pool without blocking.
**When to use:** All MinIO operations (`put_object`, `get_object`, `bucket_exists`, `presigned_get_object`) called from `async def` handlers or services.
**Example:**
```python
# storage/minio_backend.py
import asyncio
import io
import uuid
from datetime import timedelta
from minio import Minio
from storage.base import StorageBackend
class MinIOBackend(StorageBackend):
def __init__(self, endpoint: str, access_key: str, secret_key: str,
bucket: str, secure: bool = False):
self._client = Minio(
endpoint=endpoint,
access_key=access_key,
secret_key=secret_key,
secure=secure, # False for Docker internal network (HTTP)
)
self._bucket = bucket
async def put_object(
self,
user_id: str,
document_id: str,
file_bytes: bytes,
extension: str,
content_type: str,
) -> str:
object_key = f"{user_id}/{document_id}/{uuid.uuid4()}{extension}"
data = io.BytesIO(file_bytes)
await asyncio.to_thread(
self._client.put_object,
self._bucket,
object_key,
data,
length=len(file_bytes),
content_type=content_type,
)
return object_key
async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str:
return await asyncio.to_thread(
self._client.presigned_get_object,
bucket_name=self._bucket,
object_name=object_key,
expires=timedelta(minutes=expires_minutes),
)
async def health_check(self) -> bool:
try:
return await asyncio.to_thread(
self._client.bucket_exists, self._bucket
)
except Exception:
return False
```
**MinIO `put_object` signature (confirmed):**
```python
client.put_object(
bucket_name: str,
object_name: str, # the object key
data: io.RawIOBase, # io.BytesIO is accepted
length: int, # -1 with part_size for unknown-length streams
content_type: str = "application/octet-stream",
)
```
**Note on `length=-1`:** For unknown-length streams, set `length=-1` and `part_size=10*1024*1024`. For in-memory `io.BytesIO`, always pass `length=len(bytes)` — this avoids a multipart upload when not needed.
**Source:** [CITED: github.com/minio/minio-py/blob/master/docs/API.md]
---
### Pattern 4: MinIO Bucket Initialization at Startup
**What:** On first `docker compose up`, MinIO starts with an empty state. The application must create the `docuvault` bucket if it doesn't exist. This is done in the FastAPI lifespan, not in user request handlers.
**Example:**
```python
# main.py lifespan extension
@asynccontextmanager
async def lifespan(app: FastAPI):
# PostgreSQL engine + pool
# MinIO bucket initialization
minio_client = Minio(
settings.minio_endpoint,
access_key=settings.minio_access_key,
secret_key=settings.minio_secret_key,
secure=False,
)
exists = await asyncio.to_thread(minio_client.bucket_exists, settings.minio_bucket)
if not exists:
await asyncio.to_thread(minio_client.make_bucket, settings.minio_bucket)
app.state.minio = minio_client
yield
await engine.dispose()
```
---
### Pattern 5: Celery App + Redis Broker Configuration
**What:** A single `celery_app.py` module defines the Celery application. Tasks are defined as decorated functions. FastAPI route handlers call `.delay()` to enqueue; the celery-worker container processes them.
**Redis URL format (with password, Docker internal network):**
```
redis://:${REDIS_PASSWORD}@redis:6379/0
```
The `:` before the password with no username is the correct format when Redis is configured with `requirepass` but no ACL users. [CITED: docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html via WebSearch]
**Example:**
```python
# celery_app.py
import os
from celery import Celery
celery_app = Celery("docuvault")
celery_app.conf.broker_url = os.environ.get("REDIS_URL", "redis://redis:6379/0")
celery_app.conf.result_backend = os.environ.get("REDIS_URL", "redis://redis:6379/0")
celery_app.conf.task_serializer = "json"
celery_app.conf.result_serializer = "json"
celery_app.conf.accept_content = ["json"]
celery_app.conf.task_routes = {
"tasks.document_tasks.*": {"queue": "documents"},
}
# tasks/document_tasks.py
from celery_app import celery_app
@celery_app.task(name="tasks.document_tasks.extract_and_classify")
def extract_and_classify(document_id: str) -> dict:
# Celery tasks are SYNCHRONOUS functions — do NOT use async def here.
# Use asyncio.run() sparingly or run sync equivalents of extractor/classifier.
from services import extractor, classifier
...
# api/documents.py — calling the task
from tasks.document_tasks import extract_and_classify
@router.post("/upload")
async def upload_document(...):
...
# Replace: background_tasks.add_task(classifier.classify_document, doc_id)
# With:
extract_and_classify.delay(str(saved_doc.id))
return meta
```
**Critical: Celery tasks are synchronous.** The Celery worker runs a standard Python event loop (not asyncio). Calling `async def` functions inside a Celery task requires `asyncio.run()`, which creates a new event loop per task invocation. This is acceptable for Phase 1 since the existing `extractor.py` and `classifier.py` services already have sync and async entry points, but keep tasks pure-sync where possible. [VERIFIED via WebSearch cross-checked with official docs]
**Worker startup command:**
```
celery -A celery_app worker --loglevel=info -Q documents
```
---
### Pattern 6: Docker Compose Health Checks + `depends_on`
**What:** Each infrastructure service has a `healthcheck` definition. The `backend` service uses `depends_on: condition: service_healthy` to wait for all three (postgres, minio, redis) before starting.
**Example:**
```yaml
services:
postgres:
image: postgres:17-alpine
environment:
POSTGRES_DB: docuvault
POSTGRES_USER: postgres
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
- ./docker/postgres/initdb.d:/docker-entrypoint-initdb.d:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres -d docuvault"]
interval: 10s
timeout: 5s
retries: 5
start_period: 10s
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
environment:
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
ports:
- "9000:9000"
- "9001:9001"
volumes:
- minio_data:/data
healthcheck:
# curl is removed from recent MinIO images; use the /minio/health/live HTTP endpoint
# from the host. Inside the container, mc is available:
test: ["CMD", "mc", "ready", "local"]
interval: 10s
timeout: 5s
retries: 5
start_period: 15s
redis:
image: redis:7-alpine
command: redis-server --requirepass ${REDIS_PASSWORD}
healthcheck:
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
interval: 10s
timeout: 3s
retries: 5
backend:
depends_on:
postgres:
condition: service_healthy
minio:
condition: service_healthy
redis:
condition: service_healthy
```
**MinIO healthcheck note:** `curl` was removed from MinIO's Docker image in October 2023. The `mc ready local` command is the current recommended healthcheck inside the container. The `/minio/health/live` HTTP endpoint (returns 200 OK) is still valid for external probing but cannot be used inside the container without curl. [CITED: github.com/minio/minio/issues/18389]
---
### Pattern 7: PostgreSQL Two-User Init Script
**What:** The official PostgreSQL Docker image runs scripts in `/docker-entrypoint-initdb.d/` on first start (empty volume). A SQL script provisions two users: `docuvault_migrate` (DDL) and `docuvault_app` (runtime, restricted).
**When to use:** First `docker compose up` with a fresh volume. Idempotent for re-runs is not required — init scripts only run once.
**Example:**
```sql
-- docker/postgres/initdb.d/01-init-users.sql
-- Runs as the POSTGRES_USER superuser on first container start only.
-- Migration user: DDL privileges (CREATE TABLE, ALTER TABLE, CREATE INDEX)
CREATE USER docuvault_migrate WITH PASSWORD 'PLACEHOLDER_MIGRATE_PASSWORD';
GRANT ALL PRIVILEGES ON DATABASE docuvault TO docuvault_migrate;
-- App user: runtime DML only (SELECT, INSERT, UPDATE, DELETE)
CREATE USER docuvault_app WITH PASSWORD 'PLACEHOLDER_APP_PASSWORD';
GRANT CONNECT ON DATABASE docuvault TO docuvault_app;
-- Grant schema-level privileges AFTER migration user creates the schema
-- This must run after alembic upgrade head, OR grant in a second script.
-- Pattern: grant via a post-migration step or grant within the migration itself:
-- GRANT USAGE ON SCHEMA public TO docuvault_app;
-- GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO docuvault_app;
-- ALTER DEFAULT PRIVILEGES IN SCHEMA public
-- GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO docuvault_app;
```
**Important:** The `GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES` must be run AFTER Alembic has created the tables, because `ON ALL TABLES` applies only to existing tables. Use `ALTER DEFAULT PRIVILEGES` so future tables (from future migrations) are also accessible. This can be done at the end of the first Alembic migration file, or in a post-migration Docker entrypoint hook.
**Recommended approach for Phase 1:** Run the GRANT as the last step of the `0001_initial_schema.py` migration using `op.execute()` as the `docuvault_migrate` user (which has full privileges). [ASSUMED — no official doc confirming this is the standard Alembic pattern, but it follows from standard PostgreSQL privilege management]
---
### Pattern 8: StorageBackend ABC (Mirrors `ai/` Pattern)
**What:** `storage/base.py` defines `StorageBackend` as an abstract base class with the same structure as `ai/base.py`. `storage/__init__.py` provides a `get_storage_backend()` factory. `storage/minio_backend.py` is the Phase 1 implementation.
**Example:**
```python
# storage/base.py
from abc import ABC, abstractmethod
class StorageBackend(ABC):
@abstractmethod
async def put_object(
self, user_id: str, document_id: str,
file_bytes: bytes, extension: str, content_type: str,
) -> str:
"""Store object; return the object_key used."""
@abstractmethod
async def get_object(self, object_key: str) -> bytes:
"""Retrieve object bytes by key."""
@abstractmethod
async def delete_object(self, object_key: str) -> None:
"""Delete object by key."""
@abstractmethod
async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str:
"""Return a time-limited download URL."""
@abstractmethod
async def health_check(self) -> bool:
"""Return True if backend is reachable."""
# storage/__init__.py
from config import settings
from storage.minio_backend import MinIOBackend
def get_storage_backend() -> StorageBackend:
return MinIOBackend(
endpoint=settings.minio_endpoint,
access_key=settings.minio_access_key,
secret_key=settings.minio_secret_key,
bucket=settings.minio_bucket,
secure=False,
)
```
---
### Anti-Patterns to Avoid
- **Sync SQLAlchemy in async context:** Using `create_engine()` instead of `create_async_engine()` in FastAPI will block the event loop on every database call. Use `create_async_engine` throughout.
- **Calling `await session.commit()` then accessing lazy-loaded attributes:** Always set `expire_on_commit=False` or explicitly refresh after commit.
- **Connecting Alembic using `DATABASE_URL` (restricted user):** The restricted `docuvault_app` user has no DDL privileges. Alembic migrations will fail with `permission denied` errors. Alembic must always use `DATABASE_MIGRATE_URL`.
- **Using `async def` for Celery task functions:** Celery workers do not run an asyncio event loop. Define tasks as `def`, not `async def`. Wrap any async calls with `asyncio.run()` if unavoidable, but prefer sync implementations in tasks.
- **Storing human-readable filename as MinIO object key:** Object keys must be UUID-based (`{user_id}/{document_id}/{uuid4()}{ext}`). Filenames are stored ONLY in the `documents.filename` DB column. Putting human filenames in the key enables path traversal and makes key prediction trivial.
- **Using `minio_client.bucket_exists()` inside async handlers without `asyncio.to_thread`:** The MinIO SDK is synchronous; calling it directly from `async def` will block the event loop.
- **MinIO `mc ready local` healthcheck with a password-protected Redis `redis-cli ping`:** For Redis with `requirepass`, the healthcheck must pass `-a $REDIS_PASSWORD` to `redis-cli`. A bare `redis-cli ping` will return `NOAUTH` and be treated as unhealthy.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Async PostgreSQL session management | Custom connection/context manager | SQLAlchemy `async_sessionmaker` + `Depends(get_db)` | Handles connection pooling, transaction boundaries, error cleanup, and the `expire_on_commit` edge case |
| Database schema migrations | Manual `CREATE TABLE` scripts in Python | Alembic | Manages migration history, rollbacks, auto-generation from ORM models, and multi-environment DSN configuration |
| MinIO object lifecycle | Custom S3-like HTTP client | `minio` Python SDK | Handles multipart uploads, signature v4, presigned URL expiry, retry logic, and connection pooling |
| Background task distribution | Thread pools or `asyncio.create_task()` | Celery + Redis | Cross-instance task distribution, retry on failure, dead letter queues, task result storage |
| Docker service ordering | `sleep` commands in Compose entrypoints | `healthcheck` + `depends_on: condition: service_healthy` | Deterministic, declarative; `sleep` is a race condition |
| PostgreSQL privilege management | Per-table GRANT scripts written by hand | `ALTER DEFAULT PRIVILEGES` in Alembic migration | Future migrations automatically inherit privileges; hand-written grants go stale |
**Key insight:** The existing `filelock`-based `services/storage.py` uses at least 6 custom concurrency primitives to solve problems that PostgreSQL's transaction isolation and MinIO's atomic object operations solve at the infrastructure level. The rewrite simplifies the code while gaining correctness guarantees.
---
## Common Pitfalls
### Pitfall 1: `expire_on_commit=True` (the default) Causes `MissingGreenlet`
**What goes wrong:** After `await session.commit()`, accessing any ORM object attribute triggers a new SELECT query. In async context, if there is no active session scope, SQLAlchemy raises `sqlalchemy.exc.MissingGreenlet: greenlet_spawn has not been called`.
**Why it happens:** The default `Session.expire_on_commit=True` marks objects as "expired" post-commit. The next attribute access triggers a lazy load, which needs a sync greenlet context (not available in asyncio).
**How to avoid:** Always set `expire_on_commit=False` in `async_sessionmaker`. [CITED: docs.sqlalchemy.org]
**Warning signs:** `MissingGreenlet` in tracebacks after commit; attribute access on model instances outside `async with session` blocks.
---
### Pitfall 2: Alembic `env.py` Not Importing All Models
**What goes wrong:** `alembic revision --autogenerate` generates an empty migration even though models were defined.
**Why it happens:** Alembic's `target_metadata` must be set to `Base.metadata`, and all model modules must be imported BEFORE `target_metadata` is accessed in `env.py`. Python only knows about models that have been imported.
**How to avoid:** In `migrations/env.py`, explicitly import all model modules:
```python
from db import models # noqa: F401 — must import to register with Base.metadata
target_metadata = models.Base.metadata
```
**Warning signs:** Empty `op.` blocks in generated migrations; tables not appearing in migration history.
---
### Pitfall 3: MinIO `put_object` Requires `io.BytesIO.seek(0)` Before Use
**What goes wrong:** `put_object` reads 0 bytes if the `io.BytesIO` object's file pointer is at the end (e.g., after writing to it).
**Why it happens:** `io.BytesIO.write()` advances the pointer to the end of the data. `put_object` starts reading from the current position.
**How to avoid:** Always call `data.seek(0)` before passing a `BytesIO` to `put_object`. Or construct the `BytesIO` from the complete bytes directly: `io.BytesIO(file_bytes)` starts the pointer at 0.
**Warning signs:** MinIO reports successful upload but object is 0 bytes; or `OSError: stream having not enough data`.
---
### Pitfall 4: PostgreSQL Init Script GRANT Timing
**What goes wrong:** `docuvault_app` user gets `permission denied` on tables even after `GRANT ... ON ALL TABLES`.
**Why it happens:** `GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public` only applies to tables that exist at the time of the GRANT. Tables created by Alembic after the init script runs are not covered.
**How to avoid:** Run `ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO docuvault_app;` in the Alembic initial migration (as `docuvault_migrate` user, which owns the tables). This covers all future tables created by the same migration user.
**Warning signs:** First `docker compose up` works; second run after `alembic upgrade head` fails with 403 DB errors.
---
### Pitfall 5: Redis Healthcheck Without Authentication
**What goes wrong:** `redis-cli ping` returns `NOAUTH Authentication required` when Redis is started with `requirepass`. Docker Compose treats non-zero exit as unhealthy. Backend never starts.
**Why it happens:** `redis-cli ping` without `-a` doesn't pass the password.
**How to avoid:** Use `redis-cli -a ${REDIS_PASSWORD} ping` in the healthcheck `test` field. Note that this logs a warning about passing password on command line — acceptable for a healthcheck, not for production scripts.
**Warning signs:** `backend` service stuck at `Waiting for redis to be healthy`; `redis-cli ping` showing `NOAUTH` in container logs.
---
### Pitfall 6: MinIO `mc ready local` Healthcheck Not Available Without `mc`
**What goes wrong:** `mc` is present in the official `minio/minio` Docker image, so `mc ready local` works as a healthcheck. If using a third-party or stripped MinIO image, `mc` may be absent.
**How to avoid:** Stick to the official `minio/minio:latest` image. If a custom image is needed, use the `/minio/health/live` HTTP endpoint probed from a sidecar or from the host — not from inside the container without curl.
---
### Pitfall 7: Celery Worker Cannot Import FastAPI App Module
**What goes wrong:** Celery worker Docker container imports `celery_app.py`, which transitively imports the FastAPI app or lifespan, which tries to open database connections or access `app.state`.
**Why it happens:** Shared imports between the FastAPI app and Celery tasks create circular dependencies at module load time.
**How to avoid:** Keep `celery_app.py` minimal (Celery configuration only). Task functions in `tasks/` import services directly, not via `main.py` or any router. The Celery worker starts with `celery -A celery_app worker` — it never starts FastAPI.
---
## Code Examples
### Full v1 SQLAlchemy ORM Schema (Phase 1 Migration Target)
```python
# db/models.py
import uuid
from datetime import datetime, timezone
from sqlalchemy import (
Boolean, BigInteger, ForeignKey, Index, String, Text,
TIMESTAMP, UniqueConstraint, Integer
)
from sqlalchemy.dialects.postgresql import UUID, INET, JSONB
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column, relationship
from sqlalchemy.sql import func
def now_utc():
return datetime.now(timezone.utc)
class Base(DeclarativeBase):
pass
class User(Base):
__tablename__ = "users"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
handle: Mapped[str] = mapped_column(String, unique=True, nullable=False)
email: Mapped[str] = mapped_column(String, unique=True, nullable=False)
password_hash: Mapped[str] = mapped_column(Text, nullable=False)
totp_secret: Mapped[str | None] = mapped_column(Text, nullable=True)
totp_enabled: Mapped[bool] = mapped_column(Boolean, nullable=False, default=False)
role: Mapped[str] = mapped_column(String, nullable=False, default="user")
is_active: Mapped[bool] = mapped_column(Boolean, nullable=False, default=True)
ai_provider: Mapped[str | None] = mapped_column(Text, nullable=True)
ai_model: Mapped[str | None] = mapped_column(Text, nullable=True)
default_storage_backend: Mapped[str] = mapped_column(String, nullable=False, default="minio")
created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
class Quota(Base):
__tablename__ = "quotas"
user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), primary_key=True)
limit_bytes: Mapped[int] = mapped_column(BigInteger, nullable=False, default=104857600) # 100 MB
used_bytes: Mapped[int] = mapped_column(BigInteger, nullable=False, default=0)
class RefreshToken(Base):
__tablename__ = "refresh_tokens"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False)
token_hash: Mapped[str] = mapped_column(Text, unique=True, nullable=False)
expires_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False)
revoked: Mapped[bool] = mapped_column(Boolean, nullable=False, default=False)
created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
__table_args__ = (Index("ix_refresh_tokens_user_revoked", "user_id", "revoked"),)
class Folder(Base):
__tablename__ = "folders"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False)
parent_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("folders.id", ondelete="CASCADE"), nullable=True)
name: Mapped[str] = mapped_column(Text, nullable=False)
created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
__table_args__ = (UniqueConstraint("user_id", "parent_id", "name"),)
class Document(Base):
__tablename__ = "documents"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
# user_id is NULLABLE in Phase 1 (D-03); Phase 2 migration adds NOT NULL
user_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=True)
folder_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("folders.id", ondelete="SET NULL"), nullable=True)
filename: Mapped[str] = mapped_column(Text, nullable=False) # original human-readable name
object_key: Mapped[str] = mapped_column(Text, nullable=False) # MinIO key: {user_id}/{doc_id}/{uuid4}{ext}
content_type: Mapped[str] = mapped_column(Text, nullable=False)
size_bytes: Mapped[int] = mapped_column(BigInteger, nullable=False, default=0)
storage_backend: Mapped[str] = mapped_column(String, nullable=False, default="minio")
extracted_text: Mapped[str | None] = mapped_column(Text, nullable=True)
status: Mapped[str] = mapped_column(String, nullable=False, default="pending")
created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
updated_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
__table_args__ = (
Index("ix_documents_user_folder", "user_id", "folder_id"),
Index("ix_documents_user_created", "user_id", "created_at"),
)
class Topic(Base):
__tablename__ = "topics"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
user_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=True)
name: Mapped[str] = mapped_column(Text, nullable=False)
description: Mapped[str] = mapped_column(Text, nullable=False, default="")
color: Mapped[str] = mapped_column(String(7), nullable=False, default="#6366f1")
__table_args__ = (UniqueConstraint("user_id", "name"),)
class DocumentTopic(Base):
__tablename__ = "document_topics"
document_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("documents.id", ondelete="CASCADE"), primary_key=True)
topic_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("topics.id", ondelete="CASCADE"), primary_key=True)
class Share(Base):
__tablename__ = "shares"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
document_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("documents.id", ondelete="CASCADE"), nullable=False)
owner_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False)
recipient_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False)
permission: Mapped[str] = mapped_column(String, nullable=False, default="view")
created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
__table_args__ = (
UniqueConstraint("document_id", "recipient_id"),
Index("ix_shares_recipient", "recipient_id"),
)
class AuditLog(Base):
__tablename__ = "audit_log"
id: Mapped[int] = mapped_column(Integer, primary_key=True, autoincrement=True)
user_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="SET NULL"), nullable=True)
actor_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="SET NULL"), nullable=True)
event_type: Mapped[str] = mapped_column(Text, nullable=False)
resource_id: Mapped[uuid.UUID | None] = mapped_column(UUID(as_uuid=True), nullable=True)
ip_address: Mapped[str | None] = mapped_column(INET, nullable=True)
metadata: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
__table_args__ = (
Index("ix_audit_user_created", "user_id", "created_at"),
Index("ix_audit_event_created", "event_type", "created_at"),
)
class CloudConnection(Base):
__tablename__ = "cloud_connections"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
user_id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=False)
provider: Mapped[str] = mapped_column(String, nullable=False)
display_name: Mapped[str] = mapped_column(Text, nullable=False)
credentials_enc: Mapped[str] = mapped_column(Text, nullable=False)
status: Mapped[str] = mapped_column(String, nullable=False, default="ACTIVE")
connected_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
__table_args__ = (Index("ix_cloud_connections_user", "user_id"),)
class Group(Base):
"""v2 stub — empty table, seeded for schema completeness (PROJECT.md)."""
__tablename__ = "groups"
id: Mapped[uuid.UUID] = mapped_column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
name: Mapped[str] = mapped_column(Text, unique=True, nullable=False)
created_at: Mapped[datetime] = mapped_column(TIMESTAMP(timezone=True), nullable=False, server_default=func.now())
```
---
### Config Extension for New Env Vars
```python
# config.py (extended)
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
# Existing
data_dir: str = "/app/data"
# Phase 1 additions
database_url: str = "postgresql+psycopg://docuvault_app:changeme@postgres/docuvault"
database_migrate_url: str = "postgresql+psycopg://docuvault_migrate:changeme@postgres/docuvault"
minio_endpoint: str = "minio:9000"
minio_access_key: str = "docuvault_app"
minio_secret_key: str = "changeme"
minio_bucket: str = "docuvault"
redis_url: str = "redis://:changeme@redis:6379/0"
secret_key: str = "CHANGEME" # documented for Phase 2; not read in Phase 1
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
settings = Settings()
```
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| `asyncpg` as the only async PostgreSQL dialect | `psycopg` v3 supports both sync + async via one package | 2022 (psycopg v3 release) | Single driver for Alembic + FastAPI; no separate sync/async packages |
| `alembic init` (sync template) | `alembic init -t async` for async engine migrations | Alembic 1.7+ | env.py template pre-configured for asyncio; no manual async wiring |
| `async_sessionmaker` equivalent was `sessionmaker` with separate import | `async_sessionmaker` is a first-class API in SQLAlchemy 2.0 | SQLAlchemy 2.0 (2023) | Cleaner factory pattern without subclassing |
| MinIO Docker image included `curl` for healthchecks | `curl` removed from image; `mc ready local` is the new healthcheck | October 2023 | Existing tutorials with `curl -f` healthcheck will silently fail on current images |
| `FastAPI BackgroundTasks` for async post-request work | Celery + Redis for distributed, reliable task queues | Ongoing | `BackgroundTasks` is per-instance and has no retry; Celery is cross-instance |
**Deprecated/outdated:**
- `filelock` dependency: can be removed from `backend/requirements.txt` once `services/storage.py` is replaced (CONCERNS.md item 14 identifies the unused `shutil` import; same cleanup applies to `filelock`).
- Per-document `.lock` files in `data/metadata/`: deleted with `data/` directory contents (D-04).
- `psycopg2` (old driver): not installed and not needed; `psycopg` v3 is the replacement.
- Sync file I/O in async handlers (CONCERNS.md item 6): resolved entirely by switching to async SQLAlchemy.
---
## Assumptions Log
| # | Claim | Section | Risk if Wrong |
|---|-------|---------|---------------|
| A1 | Running `GRANT ... ON ALL TABLES` inside the Alembic initial migration as `docuvault_migrate` is the standard pattern for privilege handoff to `docuvault_app` | Pattern 7 (PostgreSQL init script) | If the migration user lacks permission to GRANT to another user, privileges must be set manually or via a separate script — delays testing |
| A2 | The Celery worker container can import `db/models.py` and `services/` directly without starting FastAPI (no circular import) | Pattern 5 (Celery) | If service modules import FastAPI components at module level, a refactor is needed before worker tasks can import services |
| A3 | `minio/minio:latest` Docker image includes `mc` for the `mc ready local` healthcheck | Pattern 6 (Docker Compose) | If `mc` is not in the image, healthcheck must use a shell-based TCP probe or alternative; confirmed via GitHub issue discussion [CITED: github.com/minio/minio/issues/18389] but version-specific |
---
## Open Questions
1. **PostgreSQL version to pin in Docker Compose**
- What we know: Any PostgreSQL 14+ supports `gen_random_uuid()`, `JSONB`, `INET`, and `TIMESTAMPTZ` used in the schema.
- What's unclear: Whether to use `postgres:16`, `postgres:17`, or `postgres:17-alpine`.
- Recommendation: Use `postgres:17-alpine` (smallest image, current stable, alpine is well-suited for Docker Compose dev setups).
2. **MinIO version pinning**
- What we know: `minio/minio:latest` has `mc` available for healthchecks; `curl` was removed in late 2023.
- What's unclear: Whether to pin to a specific release tag (e.g., `RELEASE.2025-09-07T16-13-09Z`) or use `:latest`.
- Recommendation: Pin to a specific RELEASE tag for reproducibility; update as part of a maintenance task. [ASSUMED — no strong official guidance on whether `:latest` is appropriate for production-adjacent Docker Compose]
3. **Topics table migration: existing topic names from `data/topics.json`**
- What we know: D-04 deletes `data/` contents. Topics stored in `topics.json` are test data and are deleted.
- What's unclear: The existing `api/topics.py` and `frontend/src/stores/topics.js` need updating to read from PostgreSQL instead of the flat file. The API shape should remain the same (list of objects with `id`, `name`, `description`, `color`).
- Recommendation: The planner must include a task for updating `api/topics.py` to use async SQLAlchemy ORM queries against the `topics` table.
4. **Celery task vs direct service call for text extraction + classification**
- What we know: The current `api/documents.py` calls `await classifier.classify_document()` inside the route handler. This needs to move to a Celery task.
- What's unclear: Whether Phase 1 should move ALL of extraction + classification into a Celery task (full async flow) or just wire up the infrastructure with a placeholder task and migrate the logic in Phase 3.
- Recommendation: Phase 1 should wire the full task (extract + classify) in Celery — the walking skeleton requirement says "AI classification workflow completes successfully." A placeholder task that doesn't classify would fail the success criteria.
---
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| Docker | Docker Compose services | ✓ | 29.5.0 | — |
| Python 3.12 | Backend (in Docker image) | ✓ (host: 3.14.5; Docker: 3.12 pinned) | 3.12 in image | — |
| PostgreSQL (via Docker) | Database tier | ✓ (via Docker) | 17 (image) | — |
| MinIO (via Docker) | Object storage | ✓ (via Docker) | latest | — |
| Redis (via Docker) | Celery broker, Phase 2 rate limiting | ✓ (via Docker) | 7-alpine | — |
| pytest | Backend test runner | ✓ (host pip3) | existing | — |
**Missing dependencies with no fallback:** None.
**Missing dependencies with fallback:** None.
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | pytest with pytest-asyncio (existing) |
| Config file | `backend/pytest.ini` (existing; `asyncio_mode = auto`) |
| Quick run command | `cd backend && pytest tests/test_health.py tests/test_documents.py tests/test_storage.py -x` |
| Full suite command | `cd backend && pytest -v` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| STORE-01 | Upload stores metadata in PostgreSQL and bytes in MinIO | integration | `pytest tests/test_documents.py::test_upload_stores_to_postgres_and_minio -x` | ❌ Wave 0 |
| STORE-01 | List documents reads from PostgreSQL (not filesystem) | integration | `pytest tests/test_documents.py::test_list_reads_from_db -x` | ❌ Wave 0 |
| STORE-02 | MinIO object key matches `{user_id}/{document_id}/{uuid4}{ext}` pattern | unit | `pytest tests/test_storage.py::test_object_key_schema -x` | ❌ Wave 0 |
| STORE-02 | Human-readable filename is NOT in the object key | unit | `pytest tests/test_storage.py::test_filename_not_in_object_key -x` | ❌ Wave 0 |
| STORE-07 | `/health` returns PostgreSQL + MinIO connectivity (not just `{"status": "ok"}`) | smoke | `pytest tests/test_health.py::test_health_checks_postgres_and_minio -x` | ❌ Wave 0 |
| STORE-07 (implicit) | Storage service has no file locks; concurrent uploads do not corrupt state | integration | `pytest tests/test_documents.py::test_concurrent_uploads -x` | ❌ Wave 0 |
### Sampling Rate
- **Per task commit:** `cd backend && pytest tests/test_health.py tests/test_storage.py -x`
- **Per wave merge:** `cd backend && pytest -v`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `tests/test_storage.py` — covers STORE-02 (object key schema, filename isolation)
- [ ] `tests/test_documents.py` — extend for PostgreSQL/MinIO-backed upload/list (STORE-01)
- [ ] `tests/test_health.py` — extend for PostgreSQL + MinIO connectivity probes (STORE-07)
- [ ] `tests/conftest.py` — add async engine + session fixtures; add MinIO mock or test bucket fixture
- [ ] Update `tests/conftest.py` to monkeypatch `db/session.py` paths (not just `config.py` paths)
**Existing tests:** `test_documents.py`, `test_topics.py`, `test_settings.py` test the OLD flat-file storage layer. They will break after `services/storage.py` is replaced. These must be ported (not deleted) as part of Phase 1.
---
## Security Domain
### Applicable ASVS Categories
| ASVS Category | Applies | Standard Control |
|---------------|---------|-----------------|
| V2 Authentication | No — Phase 1 has no auth | Phase 2 |
| V3 Session Management | No — Phase 1 has no sessions | Phase 2 |
| V4 Access Control | Partial — object key isolation in MinIO backend | `user_id` prefix enforced in `MinIOBackend.put_object()` |
| V5 Input Validation | Yes — file upload content type + size | Existing `ALLOWED_MIME_TYPES` enforcement (currently unenforced per CONCERNS.md item 1) |
| V6 Cryptography | No — Phase 1 has no credential encryption | Phase 5 |
### Known Threat Patterns for This Phase
| Pattern | STRIDE | Standard Mitigation |
|---------|--------|---------------------|
| Object key prediction / path traversal | Tampering | UUID-based object keys (`{user_id}/{document_id}/{uuid4}{ext}`); never accept object keys from request parameters |
| Database superuser credentials in app DSN | Elevation of Privilege | Two-DSN pattern: `docuvault_app` (restricted) for runtime, `docuvault_migrate` (DDL) for Alembic only |
| MinIO credentials with bucket admin rights | Elevation of Privilege | App-level access key pair (`MINIO_ACCESS_KEY` / `MINIO_SECRET_KEY`) with read/write on `docuvault` bucket only; root credentials not used by app |
| Redis unauthenticated in Docker network | Information Disclosure | `requirepass` set on Redis; `REDIS_URL` includes password; Celery broker and app use authenticated URL |
| SQL injection via ORM | Tampering | SQLAlchemy ORM / parameterized queries throughout; zero raw string interpolation (matches CLAUDE.md SEC-03) |
| Sensitive data in MinIO object key | Information Disclosure | Human-readable filenames stored in DB only; object key is UUID-based and non-predictable |
---
## Sources
### Primary (HIGH confidence)
- [docs.sqlalchemy.org/en/20/orm/extensions/asyncio.html](https://docs.sqlalchemy.org/en/20/orm/extensions/asyncio.html) — async engine setup, `async_sessionmaker`, `expire_on_commit=False`, FastAPI lifespan integration
- [alembic.sqlalchemy.org/en/latest/cookbook.html#using-asyncio-with-alembic](https://alembic.sqlalchemy.org/en/latest/cookbook.html) — async `env.py` pattern
- [github.com/sqlalchemy/alembic/blob/main/alembic/templates/async/env.py](https://github.com/sqlalchemy/alembic/blob/main/alembic/templates/async/env.py) — official async env.py template code
- [github.com/minio/minio-py/blob/master/docs/API.md](https://github.com/minio/minio-py/blob/master/docs/API.md) — `put_object`, `presigned_get_object`, constructor signatures
- [github.com/minio/minio/issues/18389](https://github.com/minio/minio/issues/18389) — `curl` removal from MinIO image; `mc ready local` as replacement
- [docs.min.io/enterprise/aistor-object-store/operations/monitoring/healthcheck-probe/](https://docs.min.io/enterprise/aistor-object-store/operations/monitoring/healthcheck-probe/) — `/minio/health/live` endpoint documented
- [docs.docker.com/reference/compose-file/services/#healthcheck](https://docs.docker.com/reference/compose-file/services/#healthcheck) — `healthcheck` + `depends_on: condition: service_healthy` syntax
### Secondary (MEDIUM confidence)
- [docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html](https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html) — Redis URL format verified via WebSearch; Celery docs site was unreachable during research session
- [testdriven.io/blog/fastapi-and-celery/](https://testdriven.io/blog/fastapi-and-celery/) — Celery + FastAPI project structure and `.delay()` pattern
- WebSearch results cross-referenced with official docs for psycopg install extras, Redis broker URL format, PostgreSQL init script pattern
### Tertiary (LOW confidence)
- None — all key claims cross-verified with at least one authoritative source
---
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH — all packages verified on PyPI via `pip3 index versions`, slopcheck [OK] for all 6 core packages
- Architecture: HIGH — patterns drawn from SQLAlchemy official docs, Alembic official template, and MinIO official GitHub
- Pitfalls: HIGH — each pitfall sourced from official documentation or confirmed GitHub issues (not community blog posts only)
- Celery configuration: MEDIUM — Celery docs site was unreachable; URL format cross-verified via WebSearch + community sources
**Research date:** 2026-05-21
**Valid until:** 2026-06-21 for stable stack; MinIO healthcheck pattern should be re-verified if the Docker image version changes significantly