6fed5ba531
Research, pattern mapping, and verification complete. Walking Skeleton mode active (MVP Phase 1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1083 lines
38 KiB
Markdown
1083 lines
38 KiB
Markdown
# Phase 1: Infrastructure Foundation - Pattern Map
|
||
|
||
**Mapped:** 2026-05-21
|
||
**Files analyzed:** 14 new/modified files
|
||
**Analogs found:** 12 / 14
|
||
|
||
---
|
||
|
||
## File Classification
|
||
|
||
| New/Modified File | Role | Data Flow | Closest Analog | Match Quality |
|
||
|-------------------|------|-----------|----------------|---------------|
|
||
| `docker-compose.yml` | config | request-response | `docker-compose.yml` (current) | exact — extend in-place |
|
||
| `docker/postgres/initdb.d/01-init-users.sql` | config | batch | none in codebase | no analog |
|
||
| `backend/db/session.py` | config | CRUD | `backend/config.py` (module-level setup pattern) | partial |
|
||
| `backend/db/models.py` | model | CRUD | none in codebase | no analog (schema from RESEARCH.md) |
|
||
| `backend/deps/db.py` | utility | CRUD | `backend/config.py` (module-level constants pattern) | partial |
|
||
| `backend/config.py` | config | request-response | `backend/config.py` (current) | exact — extend in-place |
|
||
| `backend/main.py` | config | request-response | `backend/main.py` (current) | exact — extend in-place |
|
||
| `backend/storage/base.py` | utility | request-response | `backend/ai/base.py` | exact role-match |
|
||
| `backend/storage/__init__.py` | utility | request-response | `backend/ai/__init__.py` | exact role-match |
|
||
| `backend/storage/minio_backend.py` | service | file-I/O | `backend/ai/openai_provider.py` | role-match (ABC impl) |
|
||
| `backend/services/storage.py` | service | CRUD | `backend/services/storage.py` (current) | exact — replace in-place |
|
||
| `backend/celery_app.py` | config | event-driven | none in codebase | no analog |
|
||
| `backend/tasks/document_tasks.py` | service | event-driven | `backend/services/classifier.py` | role-match (orchestration) |
|
||
| `backend/api/documents.py` | controller | request-response | `backend/api/documents.py` (current) | exact — update in-place |
|
||
| `backend/api/topics.py` | controller | request-response | `backend/api/topics.py` (current) | exact — update in-place |
|
||
| `backend/requirements.txt` | config | — | `backend/requirements.txt` (current) | exact — extend in-place |
|
||
| `.env.example` | config | — | `.env.example` (current) | exact — extend in-place |
|
||
| `backend/tests/conftest.py` | test | CRUD | `backend/tests/conftest.py` (current) | exact — update in-place |
|
||
| `backend/tests/test_health.py` | test | request-response | `backend/tests/test_health.py` (current) | exact — update in-place |
|
||
| `backend/tests/test_documents.py` | test | CRUD | `backend/tests/test_documents.py` (current) | exact — update in-place |
|
||
| `backend/tests/test_storage.py` | test | file-I/O | none in codebase | no analog (new) |
|
||
| `backend/alembic.ini` | config | — | none in codebase | no analog |
|
||
| `backend/migrations/env.py` | config | batch | none in codebase | no analog (pattern from RESEARCH.md) |
|
||
| `backend/migrations/versions/0001_initial_schema.py` | migration | batch | none in codebase | no analog (schema from RESEARCH.md) |
|
||
|
||
---
|
||
|
||
## Pattern Assignments
|
||
|
||
### `docker-compose.yml` (config, request-response)
|
||
|
||
**Analog:** `docker-compose.yml` (current, lines 1–26)
|
||
|
||
**Existing service block pattern** (lines 1–26 of current `docker-compose.yml`):
|
||
```yaml
|
||
services:
|
||
backend:
|
||
build: ./backend
|
||
ports:
|
||
- "8000:8000"
|
||
volumes:
|
||
- ./backend/data:/app/data
|
||
- ./backend:/app
|
||
environment:
|
||
- DATA_DIR=/app/data
|
||
- PYTHONDONTWRITEBYTECODE=1
|
||
extra_hosts:
|
||
- "host.docker.internal:host-gateway"
|
||
command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload
|
||
|
||
frontend:
|
||
build: ./frontend
|
||
ports:
|
||
- "5173:5173"
|
||
volumes:
|
||
- ./frontend/src:/app/src
|
||
- ./frontend/index.html:/app/index.html
|
||
depends_on:
|
||
- backend
|
||
command: npm run dev -- --host 0.0.0.0
|
||
```
|
||
|
||
**New services to add — copy structure from RESEARCH.md Pattern 6** (lines 512–567):
|
||
```yaml
|
||
postgres:
|
||
image: postgres:17-alpine
|
||
environment:
|
||
POSTGRES_DB: docuvault
|
||
POSTGRES_USER: postgres
|
||
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
|
||
volumes:
|
||
- postgres_data:/var/lib/postgresql/data
|
||
- ./docker/postgres/initdb.d:/docker-entrypoint-initdb.d:ro
|
||
healthcheck:
|
||
test: ["CMD-SHELL", "pg_isready -U postgres -d docuvault"]
|
||
interval: 10s
|
||
timeout: 5s
|
||
retries: 5
|
||
start_period: 10s
|
||
|
||
minio:
|
||
image: minio/minio:latest
|
||
command: server /data --console-address ":9001"
|
||
environment:
|
||
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
|
||
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
|
||
ports:
|
||
- "9000:9000"
|
||
- "9001:9001"
|
||
volumes:
|
||
- minio_data:/data
|
||
healthcheck:
|
||
test: ["CMD", "mc", "ready", "local"]
|
||
interval: 10s
|
||
timeout: 5s
|
||
retries: 5
|
||
start_period: 15s
|
||
|
||
redis:
|
||
image: redis:7-alpine
|
||
command: redis-server --requirepass ${REDIS_PASSWORD}
|
||
healthcheck:
|
||
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
|
||
interval: 10s
|
||
timeout: 3s
|
||
retries: 5
|
||
|
||
celery-worker:
|
||
build: ./backend
|
||
command: celery -A celery_app worker --loglevel=info -Q documents
|
||
environment:
|
||
- DATABASE_URL=${DATABASE_URL}
|
||
- REDIS_URL=${REDIS_URL}
|
||
- MINIO_ENDPOINT=${MINIO_ENDPOINT}
|
||
- MINIO_ACCESS_KEY=${MINIO_ACCESS_KEY}
|
||
- MINIO_SECRET_KEY=${MINIO_SECRET_KEY}
|
||
- MINIO_BUCKET=${MINIO_BUCKET}
|
||
depends_on:
|
||
postgres:
|
||
condition: service_healthy
|
||
redis:
|
||
condition: service_healthy
|
||
minio:
|
||
condition: service_healthy
|
||
```
|
||
|
||
**`backend` service update — add `depends_on` conditions:**
|
||
```yaml
|
||
backend:
|
||
...
|
||
environment:
|
||
- DATABASE_URL=${DATABASE_URL}
|
||
- DATABASE_MIGRATE_URL=${DATABASE_MIGRATE_URL}
|
||
- MINIO_ENDPOINT=${MINIO_ENDPOINT}
|
||
- MINIO_ACCESS_KEY=${MINIO_ACCESS_KEY}
|
||
- MINIO_SECRET_KEY=${MINIO_SECRET_KEY}
|
||
- MINIO_BUCKET=${MINIO_BUCKET}
|
||
- REDIS_URL=${REDIS_URL}
|
||
- PYTHONDONTWRITEBYTECODE=1
|
||
depends_on:
|
||
postgres:
|
||
condition: service_healthy
|
||
minio:
|
||
condition: service_healthy
|
||
redis:
|
||
condition: service_healthy
|
||
```
|
||
|
||
**Remove** the `volumes:` entry for `./backend/data:/app/data` — flat-file storage is deleted (D-04).
|
||
|
||
**Add named volumes block at end of file:**
|
||
```yaml
|
||
volumes:
|
||
postgres_data:
|
||
minio_data:
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/config.py` (config, request-response)
|
||
|
||
**Analog:** `backend/config.py` (current, lines 1–52)
|
||
|
||
**Existing pattern** (lines 1–10 — module-level constants, NOT Pydantic Settings):
|
||
```python
|
||
import json
|
||
import os
|
||
from pathlib import Path
|
||
|
||
DATA_DIR = Path(os.environ.get("DATA_DIR", "/app/data"))
|
||
UPLOADS_DIR = DATA_DIR / "uploads"
|
||
METADATA_DIR = DATA_DIR / "metadata"
|
||
TOPICS_FILE = DATA_DIR / "topics.json"
|
||
SETTINGS_FILE = DATA_DIR / "settings.json"
|
||
```
|
||
|
||
**Replace entirely with Pydantic Settings** (per RESEARCH.md Code Examples, lines 914–937).
|
||
The existing `config.py` does not use `pydantic-settings` — Phase 1 introduces it. The pattern to follow is the RESEARCH.md example, not the current file. Keep the `DEFAULT_SYSTEM_PROMPT` and `DEFAULT_SETTINGS` constants for backward compatibility during the transition; remove `ensure_data_dirs()` and all path constants once `services/storage.py` is replaced.
|
||
|
||
**New pattern:**
|
||
```python
|
||
# backend/config.py
|
||
from pydantic_settings import BaseSettings
|
||
|
||
class Settings(BaseSettings):
|
||
# Legacy — keep during transition, remove after storage.py rewrite
|
||
data_dir: str = "/app/data"
|
||
|
||
# Phase 1 additions
|
||
database_url: str = "postgresql+psycopg://docuvault_app:changeme@postgres/docuvault"
|
||
database_migrate_url: str = "postgresql+psycopg://docuvault_migrate:changeme@postgres/docuvault"
|
||
minio_endpoint: str = "minio:9000"
|
||
minio_access_key: str = "docuvault_app"
|
||
minio_secret_key: str = "changeme"
|
||
minio_bucket: str = "docuvault"
|
||
redis_url: str = "redis://:changeme@redis:6379/0"
|
||
secret_key: str = "CHANGEME" # documented for Phase 2; not read in Phase 1
|
||
|
||
class Config:
|
||
env_file = ".env"
|
||
env_file_encoding = "utf-8"
|
||
|
||
settings = Settings()
|
||
```
|
||
|
||
Note: `pydantic-settings` is already in `requirements.txt` (line 4). No new dependency needed.
|
||
|
||
---
|
||
|
||
### `backend/main.py` (config, request-response)
|
||
|
||
**Analog:** `backend/main.py` (current, lines 1–34)
|
||
|
||
**Existing lifespan pattern** (lines 10–14):
|
||
```python
|
||
from contextlib import asynccontextmanager
|
||
from fastapi import FastAPI
|
||
|
||
@asynccontextmanager
|
||
async def lifespan(app: FastAPI):
|
||
ensure_data_dirs()
|
||
yield
|
||
```
|
||
|
||
**Extend lifespan** — replace `ensure_data_dirs()` call with engine setup and MinIO bucket init. Copy the `asynccontextmanager` + `yield` structure exactly:
|
||
```python
|
||
from contextlib import asynccontextmanager
|
||
import asyncio
|
||
from fastapi import FastAPI
|
||
from minio import Minio
|
||
from db.session import engine
|
||
from config import settings
|
||
|
||
@asynccontextmanager
|
||
async def lifespan(app: FastAPI):
|
||
# MinIO bucket initialization
|
||
minio_client = Minio(
|
||
settings.minio_endpoint,
|
||
access_key=settings.minio_access_key,
|
||
secret_key=settings.minio_secret_key,
|
||
secure=False,
|
||
)
|
||
exists = await asyncio.to_thread(minio_client.bucket_exists, settings.minio_bucket)
|
||
if not exists:
|
||
await asyncio.to_thread(minio_client.make_bucket, settings.minio_bucket)
|
||
app.state.minio = minio_client
|
||
yield
|
||
# Shutdown: close all pooled connections
|
||
await engine.dispose()
|
||
```
|
||
|
||
**Extend `/health` endpoint** — keep existing route signature `@app.get("/health")` and `async def health()`, extend the body:
|
||
```python
|
||
@app.get("/health")
|
||
async def health(request: Request):
|
||
checks = {}
|
||
# PostgreSQL probe
|
||
try:
|
||
async with AsyncSessionLocal() as session:
|
||
await session.execute(text("SELECT 1"))
|
||
checks["postgres"] = "ok"
|
||
except Exception as e:
|
||
checks["postgres"] = f"error: {e}"
|
||
|
||
# MinIO probe
|
||
try:
|
||
ok = await asyncio.to_thread(request.app.state.minio.bucket_exists, settings.minio_bucket)
|
||
checks["minio"] = "ok" if ok else "bucket missing"
|
||
except Exception as e:
|
||
checks["minio"] = f"error: {e}"
|
||
|
||
overall = "ok" if all(v == "ok" for v in checks.values()) else "degraded"
|
||
return {"status": overall, "checks": checks}
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/db/session.py` (config, CRUD)
|
||
|
||
**Analog:** None exact. Closest structural analog is `backend/config.py` (module-level initialization pattern at lines 1–10).
|
||
|
||
**Pattern from RESEARCH.md Pattern 1** (lines 240–266):
|
||
```python
|
||
# backend/db/session.py
|
||
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
|
||
from config import settings
|
||
|
||
engine = create_async_engine(
|
||
settings.database_url, # postgresql+psycopg://docuvault_app:...@postgres/docuvault
|
||
pool_pre_ping=True, # detect stale connections before use
|
||
echo=False,
|
||
)
|
||
|
||
AsyncSessionLocal = async_sessionmaker(
|
||
engine,
|
||
class_=AsyncSession,
|
||
expire_on_commit=False, # prevent MissingGreenlet errors after commit
|
||
)
|
||
```
|
||
|
||
**Key rule:** `expire_on_commit=False` is mandatory — see RESEARCH.md Pitfall 1.
|
||
|
||
---
|
||
|
||
### `backend/deps/db.py` (utility, CRUD)
|
||
|
||
**Analog:** None exact. The dependency injection `yield` pattern mirrors how `backend/tests/conftest.py` yields fixtures (lines 13–43).
|
||
|
||
**Pattern from RESEARCH.md Pattern 1** (lines 258–266):
|
||
```python
|
||
# backend/deps/db.py
|
||
from db.session import AsyncSessionLocal
|
||
|
||
async def get_db():
|
||
async with AsyncSessionLocal() as session:
|
||
try:
|
||
yield session
|
||
finally:
|
||
await session.close()
|
||
```
|
||
|
||
Use as a FastAPI dependency: `session: AsyncSession = Depends(get_db)`.
|
||
|
||
---
|
||
|
||
### `backend/db/models.py` (model, CRUD)
|
||
|
||
**Analog:** None in codebase. The full schema is specified in RESEARCH.md Code Examples (lines 769–908).
|
||
|
||
**Import block to copy:**
|
||
```python
|
||
import uuid
|
||
from datetime import datetime, timezone
|
||
from sqlalchemy import (
|
||
Boolean, BigInteger, ForeignKey, Index, String, Text,
|
||
TIMESTAMP, UniqueConstraint, Integer
|
||
)
|
||
from sqlalchemy.dialects.postgresql import UUID, INET, JSONB
|
||
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column, relationship
|
||
from sqlalchemy.sql import func
|
||
```
|
||
|
||
**Base class pattern:**
|
||
```python
|
||
class Base(DeclarativeBase):
|
||
pass
|
||
```
|
||
|
||
**Critical D-03:** `Document.user_id` must be `nullable=True` in Phase 1:
|
||
```python
|
||
user_id: Mapped[uuid.UUID | None] = mapped_column(
|
||
UUID(as_uuid=True), ForeignKey("users.id", ondelete="CASCADE"), nullable=True
|
||
)
|
||
```
|
||
|
||
Use the full schema from RESEARCH.md lines 788–908 verbatim — it was designed to be implementation-ready.
|
||
|
||
---
|
||
|
||
### `backend/storage/base.py` (utility, request-response)
|
||
|
||
**Analog:** `backend/ai/base.py` (lines 1–33) — exact structural match.
|
||
|
||
**ABC pattern from `backend/ai/base.py`** (lines 1–33):
|
||
```python
|
||
from abc import ABC, abstractmethod
|
||
from dataclasses import dataclass, field
|
||
|
||
class AIProvider(ABC):
|
||
@abstractmethod
|
||
async def classify(self, ...) -> ClassificationResult: ...
|
||
|
||
@abstractmethod
|
||
async def health_check(self) -> bool: ...
|
||
```
|
||
|
||
**Apply same structure** for `StorageBackend`. The `health_check()` abstract method is already present in `ai/base.py` (line 31) — mirror it exactly in `StorageBackend`. Method signatures from RESEARCH.md Pattern 8 (lines 617–640):
|
||
```python
|
||
# backend/storage/base.py
|
||
from abc import ABC, abstractmethod
|
||
|
||
class StorageBackend(ABC):
|
||
@abstractmethod
|
||
async def put_object(
|
||
self, user_id: str, document_id: str,
|
||
file_bytes: bytes, extension: str, content_type: str,
|
||
) -> str:
|
||
"""Store object; return the object_key used."""
|
||
|
||
@abstractmethod
|
||
async def get_object(self, object_key: str) -> bytes:
|
||
"""Retrieve object bytes by key."""
|
||
|
||
@abstractmethod
|
||
async def delete_object(self, object_key: str) -> None:
|
||
"""Delete object by key."""
|
||
|
||
@abstractmethod
|
||
async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str:
|
||
"""Return a time-limited download URL."""
|
||
|
||
@abstractmethod
|
||
async def health_check(self) -> bool:
|
||
"""Return True if backend is reachable."""
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/storage/__init__.py` (utility, request-response)
|
||
|
||
**Analog:** `backend/ai/__init__.py` (lines 1–36) — exact structural match.
|
||
|
||
**Factory pattern from `backend/ai/__init__.py`** (lines 1–10 and 8–36):
|
||
```python
|
||
from ai.base import AIProvider, ClassificationResult
|
||
from ai.anthropic_provider import AnthropicProvider
|
||
# ... more imports
|
||
|
||
def get_provider(settings: dict) -> AIProvider:
|
||
active = settings.get("active_provider", "lmstudio")
|
||
match active:
|
||
case "anthropic":
|
||
return AnthropicProvider(...)
|
||
case _:
|
||
raise ValueError(f"Unknown AI provider: {active}")
|
||
```
|
||
|
||
**Apply same factory pattern** for storage. Phase 1 has only one backend (MinIO), so the `match` can be omitted initially, but the factory function signature is mandatory:
|
||
```python
|
||
# backend/storage/__init__.py
|
||
from config import settings
|
||
from storage.minio_backend import MinIOBackend
|
||
from storage.base import StorageBackend
|
||
|
||
def get_storage_backend() -> StorageBackend:
|
||
return MinIOBackend(
|
||
endpoint=settings.minio_endpoint,
|
||
access_key=settings.minio_access_key,
|
||
secret_key=settings.minio_secret_key,
|
||
bucket=settings.minio_bucket,
|
||
secure=False,
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/storage/minio_backend.py` (service, file-I/O)
|
||
|
||
**Analog:** `backend/ai/openai_provider.py` (lines 1–104) — same ABC-implementation pattern.
|
||
|
||
**ABC implementation pattern from `backend/ai/openai_provider.py`** (lines 9–70):
|
||
```python
|
||
class OpenAIProvider(AIProvider):
|
||
def __init__(self, api_key: str, model: str = "gpt-4o", base_url: str | None = None):
|
||
self._api_key = api_key
|
||
self._model = model
|
||
self._base_url = base_url
|
||
|
||
def _client(self) -> AsyncOpenAI:
|
||
return AsyncOpenAI(api_key=self._api_key or "placeholder", base_url=self._base_url)
|
||
|
||
async def health_check(self) -> bool:
|
||
try:
|
||
await self._client().chat.completions.create(...)
|
||
return True
|
||
except Exception:
|
||
return False
|
||
```
|
||
|
||
Copy this structure: `__init__` stores config, private `_client` attribute holds SDK instance, every method is `async def`, `health_check` wraps in `try/except` returning `bool`.
|
||
|
||
**Key difference from AI providers:** MinIO SDK is synchronous — all calls must be wrapped in `asyncio.to_thread()`. Copy the wrapping pattern from RESEARCH.md Pattern 3 (lines 349–403):
|
||
```python
|
||
import asyncio
|
||
import io
|
||
import uuid
|
||
|
||
class MinIOBackend(StorageBackend):
|
||
def __init__(self, endpoint, access_key, secret_key, bucket, secure=False):
|
||
self._client = Minio(endpoint=endpoint, access_key=access_key,
|
||
secret_key=secret_key, secure=secure)
|
||
self._bucket = bucket
|
||
|
||
async def put_object(self, user_id, document_id, file_bytes, extension, content_type) -> str:
|
||
object_key = f"{user_id}/{document_id}/{uuid.uuid4()}{extension}"
|
||
data = io.BytesIO(file_bytes) # BytesIO() constructor sets pointer at 0 — no seek(0) needed
|
||
await asyncio.to_thread(
|
||
self._client.put_object,
|
||
self._bucket, object_key, data, length=len(file_bytes), content_type=content_type,
|
||
)
|
||
return object_key
|
||
|
||
async def health_check(self) -> bool:
|
||
try:
|
||
return await asyncio.to_thread(self._client.bucket_exists, self._bucket)
|
||
except Exception:
|
||
return False
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/services/storage.py` (service, CRUD)
|
||
|
||
**Analog:** `backend/services/storage.py` (current, lines 1–188) — replace entirely.
|
||
|
||
**Current pattern shows the data-access interface** that `api/documents.py` depends on (lines 18–95). The new implementation must preserve the same function signatures where possible to minimize changes in `api/documents.py`. The new `storage.py` is a thin orchestrator: it calls `db/session.py` for ORM operations and `storage/minio_backend.py` for object storage.
|
||
|
||
**New async signatures to match existing callers in `api/documents.py` (lines 32–57):**
|
||
```python
|
||
# Old (sync): storage.save_upload(content, file.filename, mime)
|
||
# New (async): await storage.save_upload(content, file.filename, mime)
|
||
|
||
# Old (sync): storage.save_metadata(meta)
|
||
# New (async): await storage.save_metadata(meta) — or merged into save_upload
|
||
|
||
# Old (sync): storage.list_metadata(topic=topic)
|
||
# New (async): await storage.list_metadata(topic=topic)
|
||
|
||
# Old (sync): storage.get_metadata(doc_id)
|
||
# New (async): await storage.get_metadata(doc_id)
|
||
|
||
# Old (sync): storage.delete_document(doc_id)
|
||
# New (async): await storage.delete_document(doc_id)
|
||
```
|
||
|
||
**Session injection pattern:** New `storage.py` functions accept an `AsyncSession` parameter (injected by the FastAPI dependency via `Depends(get_db)`), not create their own. This mirrors how the classifier calls storage functions with state passed in.
|
||
|
||
**Error handling from current `storage.py`** (lines 34–38 — return `None` for not-found, not exceptions):
|
||
```python
|
||
def get_metadata(doc_id: str) -> dict | None:
|
||
path = METADATA_DIR / f"{doc_id}.json"
|
||
if not path.exists():
|
||
return None
|
||
return json.loads(path.read_text())
|
||
```
|
||
Keep the same `None`-on-not-found contract in the async ORM version so `api/documents.py` `if meta is None: raise HTTPException(404, ...)` checks continue to work unchanged.
|
||
|
||
---
|
||
|
||
### `backend/celery_app.py` (config, event-driven)
|
||
|
||
**Analog:** None in codebase.
|
||
|
||
**Pattern from RESEARCH.md Pattern 5** (lines 462–475):
|
||
```python
|
||
# backend/celery_app.py
|
||
import os
|
||
from celery import Celery
|
||
|
||
celery_app = Celery("docuvault")
|
||
celery_app.conf.broker_url = os.environ.get("REDIS_URL", "redis://redis:6379/0")
|
||
celery_app.conf.result_backend = os.environ.get("REDIS_URL", "redis://redis:6379/0")
|
||
celery_app.conf.task_serializer = "json"
|
||
celery_app.conf.result_serializer = "json"
|
||
celery_app.conf.accept_content = ["json"]
|
||
celery_app.conf.task_routes = {
|
||
"tasks.document_tasks.*": {"queue": "documents"},
|
||
}
|
||
```
|
||
|
||
**Critical:** Use `os.environ.get()` directly here, NOT `from config import settings`. `config.py` imports pydantic-settings, which may trigger FastAPI-related imports. Keep `celery_app.py` minimal to avoid Pitfall 7 (circular imports with the FastAPI app).
|
||
|
||
---
|
||
|
||
### `backend/tasks/document_tasks.py` (service, event-driven)
|
||
|
||
**Analog:** `backend/services/classifier.py` (lines 1–59) — same orchestration pattern (load metadata, load settings, call services, persist results).
|
||
|
||
**Orchestration pattern from `backend/services/classifier.py`** (lines 11–46):
|
||
```python
|
||
async def classify_document(doc_id: str, topic_names: list[str] | None = None) -> list[str]:
|
||
meta = storage.get_metadata(doc_id)
|
||
if meta is None:
|
||
raise ValueError(f"Document {doc_id} not found")
|
||
|
||
settings = storage.load_settings()
|
||
provider = get_provider(settings)
|
||
text = meta.get("extracted_text", "")
|
||
result = await provider.classify(text[:MAX_AI_CHARS], topic_names, system_prompt)
|
||
# ... persist results
|
||
storage.update_document_topics(doc_id, final_topics)
|
||
return final_topics
|
||
```
|
||
|
||
**Apply same orchestration structure** for the Celery task, with three critical differences:
|
||
1. Task function must be `def`, not `async def` (Celery workers have no asyncio event loop)
|
||
2. Import services directly — never import from `main.py` or any router module
|
||
3. Use `asyncio.run()` to call async service functions if unavoidable
|
||
|
||
```python
|
||
# backend/tasks/document_tasks.py
|
||
from celery_app import celery_app
|
||
|
||
@celery_app.task(name="tasks.document_tasks.extract_and_classify")
|
||
def extract_and_classify(document_id: str) -> dict:
|
||
import asyncio
|
||
from services import extractor, classifier
|
||
# ... call services, persist results
|
||
return {"document_id": document_id, "status": "classified"}
|
||
```
|
||
|
||
**Replace in `api/documents.py`** (lines 49–56):
|
||
```python
|
||
# Old:
|
||
if auto_classify:
|
||
topics = await classifier.classify_document(saved["id"])
|
||
# New:
|
||
from tasks.document_tasks import extract_and_classify
|
||
extract_and_classify.delay(str(saved_doc.id))
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/api/documents.py` (controller, request-response)
|
||
|
||
**Analog:** `backend/api/documents.py` (current, lines 1–102) — update in-place.
|
||
|
||
**Existing route structure to preserve** (lines 21–58):
|
||
- `@router.post("/upload")` — keep signature `(file: UploadFile, auto_classify: bool)`
|
||
- `@router.get("")` — keep pagination params `(topic, page, per_page)`
|
||
- `@router.get("/{doc_id}")` — keep path param
|
||
- `@router.delete("/{doc_id}")` — keep path param
|
||
- `@router.post("/{doc_id}/classify")` — keep path param + body
|
||
|
||
**Session injection change — current** (lines 1–4):
|
||
```python
|
||
from services import storage, extractor, classifier
|
||
```
|
||
**New** — add session dependency:
|
||
```python
|
||
from fastapi import APIRouter, UploadFile, File, Form, HTTPException, Query, Depends
|
||
from sqlalchemy.ext.asyncio import AsyncSession
|
||
from deps.db import get_db
|
||
from services import storage, extractor
|
||
from tasks.document_tasks import extract_and_classify
|
||
```
|
||
|
||
**Add `session` parameter to route handlers:**
|
||
```python
|
||
@router.post("/upload")
|
||
async def upload_document(
|
||
file: UploadFile = File(...),
|
||
auto_classify: bool = Form(True),
|
||
session: AsyncSession = Depends(get_db), # NEW
|
||
):
|
||
```
|
||
|
||
**Error handling pattern** (lines 50–56 — keep unchanged):
|
||
```python
|
||
try:
|
||
topics = await classifier.classify_document(saved["id"])
|
||
meta["topics"] = topics
|
||
except Exception as e:
|
||
meta["classification_error"] = str(e) # classification failure is non-fatal
|
||
```
|
||
|
||
**HTTP error pattern** (lines 75–77 — keep unchanged):
|
||
```python
|
||
if meta is None:
|
||
raise HTTPException(404, "Document not found")
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/api/topics.py` (controller, request-response)
|
||
|
||
**Analog:** `backend/api/topics.py` (current, lines 1–73) — update in-place.
|
||
|
||
**Existing Pydantic model pattern** (lines 8–19):
|
||
```python
|
||
class TopicCreate(BaseModel):
|
||
name: str
|
||
description: str = ""
|
||
color: str = "#6366f1"
|
||
|
||
class TopicUpdate(BaseModel):
|
||
name: str | None = None
|
||
description: str | None = None
|
||
color: str | None = None
|
||
```
|
||
Keep these models unchanged — they match the PostgreSQL `topics` table columns.
|
||
|
||
**Storage call pattern** (lines 26–30):
|
||
```python
|
||
@router.get("")
|
||
async def list_topics():
|
||
topics = storage.load_topics()
|
||
counts = storage.topic_doc_counts()
|
||
```
|
||
Update to inject `session: AsyncSession = Depends(get_db)` and call async ORM queries instead of flat-file storage functions. Response shape must remain identical (`{"topics": [...]}` with `doc_count` appended per topic).
|
||
|
||
---
|
||
|
||
### `backend/requirements.txt` (config)
|
||
|
||
**Analog:** `backend/requirements.txt` (current, lines 1–16)
|
||
|
||
**Current file** (lines 1–16):
|
||
```
|
||
fastapi>=0.111
|
||
uvicorn[standard]>=0.29
|
||
python-multipart
|
||
pydantic-settings>=2.2
|
||
anthropic>=0.26
|
||
openai>=1.30
|
||
PyMuPDF>=1.24
|
||
python-docx>=1.1
|
||
pytesseract>=0.3
|
||
Pillow>=10.3
|
||
filelock>=3.14 # REMOVE — replaced by PostgreSQL transactions
|
||
aiofiles>=23.2
|
||
httpx>=0.27
|
||
pytest>=8.2
|
||
pytest-asyncio>=0.23
|
||
```
|
||
|
||
**Additions (append to file):**
|
||
```
|
||
sqlalchemy[asyncio]>=2.0
|
||
psycopg[binary]>=3.3
|
||
alembic>=1.13
|
||
minio>=7.2
|
||
celery[redis]>=5.4
|
||
redis>=7.0
|
||
```
|
||
|
||
**Remove:** `filelock>=3.14` — no longer needed once `services/storage.py` is replaced (RESEARCH.md line 952).
|
||
|
||
---
|
||
|
||
### `.env.example` (config)
|
||
|
||
**Analog:** `.env.example` (current, lines 1–6)
|
||
|
||
**Current file** (lines 1–6):
|
||
```bash
|
||
# Copy to .env and fill in as needed.
|
||
ANTHROPIC_API_KEY=
|
||
OPENAI_API_KEY=
|
||
```
|
||
|
||
**Extend with all Phase 1 vars** (D-11, D-13, D-15, D-16). Keep existing vars at top. Pattern: group by service, comment each variable:
|
||
```bash
|
||
# ── PostgreSQL ───────────────────────────────────────────────────────────────
|
||
# App user (restricted: SELECT/INSERT/UPDATE/DELETE only — used by FastAPI + Celery)
|
||
DATABASE_URL=postgresql+psycopg://docuvault_app:changeme@postgres:5432/docuvault
|
||
# Migration user (DDL privileges — used ONLY by Alembic, never by the app at runtime)
|
||
DATABASE_MIGRATE_URL=postgresql+psycopg://docuvault_migrate:changeme@postgres:5432/docuvault
|
||
# Superuser password for the postgres init container (used only by initdb.d scripts)
|
||
POSTGRES_PASSWORD=changeme
|
||
|
||
# ── MinIO ────────────────────────────────────────────────────────────────────
|
||
MINIO_ROOT_USER=minioadmin
|
||
MINIO_ROOT_PASSWORD=changeme
|
||
MINIO_ENDPOINT=minio:9000
|
||
# App-level access key (minimal permissions: read/write on docuvault bucket only)
|
||
MINIO_ACCESS_KEY=docuvault_app
|
||
MINIO_SECRET_KEY=changeme
|
||
MINIO_BUCKET=docuvault
|
||
|
||
# ── Redis ────────────────────────────────────────────────────────────────────
|
||
REDIS_PASSWORD=changeme
|
||
REDIS_URL=redis://:changeme@redis:6379/0
|
||
|
||
# ── Security (Phase 2) ───────────────────────────────────────────────────────
|
||
# Not read by the app in Phase 1. Documented here for Phase 2 JWT + HKDF use.
|
||
SECRET_KEY=CHANGEME-replace-with-64-char-random-hex
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/tests/conftest.py` (test, CRUD)
|
||
|
||
**Analog:** `backend/tests/conftest.py` (current, lines 1–71) — update in-place.
|
||
|
||
**Current fixture pattern** (lines 13–43):
|
||
```python
|
||
@pytest.fixture(autouse=True)
|
||
def isolated_data_dir(monkeypatch, tmp_path):
|
||
"""Each test gets its own clean data directory."""
|
||
data_dir = tmp_path / "data"
|
||
...
|
||
monkeypatch.setenv("DATA_DIR", str(data_dir))
|
||
import config
|
||
monkeypatch.setattr(config, "DATA_DIR", data_dir)
|
||
...
|
||
yield data_dir
|
||
```
|
||
|
||
**New async session fixture** — replace `isolated_data_dir` with an async SQLite in-memory engine for unit tests, and keep a separate fixture for integration tests using the real Docker database. Copy the `yield` + teardown structure exactly:
|
||
```python
|
||
import pytest
|
||
import pytest_asyncio
|
||
from httpx import AsyncClient, ASGITransport
|
||
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker, AsyncSession
|
||
from sqlalchemy.pool import StaticPool
|
||
from db.models import Base
|
||
from deps.db import get_db
|
||
from main import app
|
||
|
||
@pytest_asyncio.fixture
|
||
async def db_session():
|
||
"""In-memory async SQLite session for unit tests."""
|
||
engine = create_async_engine(
|
||
"sqlite+aiosqlite:///:memory:",
|
||
connect_args={"check_same_thread": False},
|
||
poolclass=StaticPool,
|
||
)
|
||
async with engine.begin() as conn:
|
||
await conn.run_sync(Base.metadata.create_all)
|
||
|
||
AsyncTestSession = async_sessionmaker(engine, expire_on_commit=False)
|
||
async with AsyncTestSession() as session:
|
||
yield session
|
||
|
||
await engine.dispose()
|
||
|
||
@pytest_asyncio.fixture
|
||
async def client(db_session):
|
||
"""Async test client with DB dependency overridden."""
|
||
app.dependency_overrides[get_db] = lambda: db_session
|
||
async with AsyncClient(transport=ASGITransport(app=app), base_url="http://test") as c:
|
||
yield c
|
||
app.dependency_overrides.clear()
|
||
```
|
||
|
||
Note: `aiosqlite` must be added to `requirements.txt` for tests. Alternatively, pin to the real PostgreSQL test database via `DATABASE_URL` env var in integration tests.
|
||
|
||
---
|
||
|
||
### `backend/tests/test_health.py` (test, request-response)
|
||
|
||
**Analog:** `backend/tests/test_health.py` (current, lines 1–5) — update in-place.
|
||
|
||
**Current test** (lines 1–5):
|
||
```python
|
||
def test_health(client):
|
||
resp = client.get("/health")
|
||
assert resp.status_code == 200
|
||
assert resp.json() == {"status": "ok"}
|
||
```
|
||
|
||
**Extended pattern** — keep the existing test function name; add new assertions for the richer response shape. Use the `async/await` style required by `pytest-asyncio`:
|
||
```python
|
||
import pytest
|
||
|
||
async def test_health_ok(client):
|
||
resp = await client.get("/health")
|
||
assert resp.status_code == 200
|
||
data = resp.json()
|
||
assert data["status"] == "ok"
|
||
|
||
async def test_health_checks_postgres_and_minio(client):
|
||
resp = await client.get("/health")
|
||
data = resp.json()
|
||
assert "checks" in data
|
||
assert "postgres" in data["checks"]
|
||
assert "minio" in data["checks"]
|
||
assert data["checks"]["postgres"] == "ok"
|
||
assert data["checks"]["minio"] == "ok"
|
||
```
|
||
|
||
---
|
||
|
||
### `backend/tests/test_documents.py` (test, CRUD)
|
||
|
||
**Analog:** `backend/tests/test_documents.py` (current, lines 1–108) — port to async.
|
||
|
||
**Current sync pattern** (lines 1–14):
|
||
```python
|
||
def test_upload_txt_no_classify(client, sample_txt):
|
||
with open(sample_txt, "rb") as f:
|
||
resp = client.post(
|
||
"/api/documents/upload",
|
||
files={"file": ("sample.txt", f, "text/plain")},
|
||
data={"auto_classify": "false"},
|
||
)
|
||
assert resp.status_code == 200
|
||
```
|
||
|
||
**Port to async — change `def` to `async def` and `client.post` to `await client.post`:**
|
||
```python
|
||
async def test_upload_txt_no_classify(client, sample_txt):
|
||
with open(sample_txt, "rb") as f:
|
||
resp = await client.post(
|
||
"/api/documents/upload",
|
||
files={"file": ("sample.txt", f, "text/plain")},
|
||
data={"auto_classify": "false"},
|
||
)
|
||
assert resp.status_code == 200
|
||
data = resp.json()
|
||
assert data["original_name"] == "sample.txt"
|
||
```
|
||
|
||
Keep all assertion logic from the current file — only the `def`→`async def` and `client.verb()`→`await client.verb()` changes are needed. Add new tests for STORE-01 and STORE-02 requirements.
|
||
|
||
---
|
||
|
||
### `backend/tests/test_storage.py` (test, file-I/O)
|
||
|
||
**Analog:** None in codebase — new file.
|
||
|
||
**Pattern from RESEARCH.md Validation section** (lines 1022–1028) and the MinIO key schema (D-06):
|
||
```python
|
||
import pytest
|
||
import re
|
||
|
||
async def test_object_key_schema(db_session):
|
||
"""STORE-02: MinIO object key must match {user_id}/{document_id}/{uuid4}{ext}."""
|
||
from storage.minio_backend import MinIOBackend
|
||
# Use a mock or capture the key returned by put_object
|
||
key = f"user-123/doc-456/{uuid.uuid4()}.pdf"
|
||
pattern = re.compile(
|
||
r'^[0-9a-f-]{36}/[0-9a-f-]{36}/[0-9a-f-]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\.[a-z]+$'
|
||
)
|
||
assert pattern.match(key)
|
||
|
||
async def test_filename_not_in_object_key():
|
||
"""STORE-02: Human-readable filename must NOT appear in the MinIO object key."""
|
||
original_name = "invoice_Q3_2025.pdf"
|
||
# The key returned by MinIOBackend.put_object must not contain the original name
|
||
from storage.minio_backend import MinIOBackend
|
||
# ... call with mock Minio client, assert key does not contain original_name
|
||
assert original_name not in generated_key
|
||
```
|
||
|
||
---
|
||
|
||
### `docker/postgres/initdb.d/01-init-users.sql` (config, batch)
|
||
|
||
**Analog:** None in codebase.
|
||
|
||
**Pattern from RESEARCH.md Pattern 7** (lines 581–599):
|
||
```sql
|
||
-- docker/postgres/initdb.d/01-init-users.sql
|
||
-- Runs as the POSTGRES_USER superuser on first container start only.
|
||
|
||
-- Migration user: DDL privileges (CREATE TABLE, ALTER TABLE, CREATE INDEX)
|
||
CREATE USER docuvault_migrate WITH PASSWORD 'PLACEHOLDER_MIGRATE_PASSWORD';
|
||
GRANT ALL PRIVILEGES ON DATABASE docuvault TO docuvault_migrate;
|
||
|
||
-- App user: runtime DML only (SELECT, INSERT, UPDATE, DELETE)
|
||
CREATE USER docuvault_app WITH PASSWORD 'PLACEHOLDER_APP_PASSWORD';
|
||
GRANT CONNECT ON DATABASE docuvault TO docuvault_app;
|
||
```
|
||
|
||
**Important:** Passwords here are Docker init-time placeholders. The actual passwords come from `.env` via `docker-compose.yml` environment vars. The init script runs once on empty volume — it cannot read env vars directly, so passwords must be hardcoded (and should match what's in `.env`).
|
||
|
||
The `ALTER DEFAULT PRIVILEGES` grant (for future tables created by Alembic) must be run inside the first Alembic migration (`0001_initial_schema.py`) using `op.execute()`, not in this init script — see RESEARCH.md Pattern 7 (lines 601–603) and Pitfall 4.
|
||
|
||
---
|
||
|
||
### `backend/alembic.ini` and `backend/migrations/env.py` (config, batch)
|
||
|
||
**Analog:** None in codebase.
|
||
|
||
**`alembic.ini` key section** (from RESEARCH.md Pattern 2, lines 328–334):
|
||
```ini
|
||
[alembic]
|
||
script_location = migrations
|
||
sqlalchemy.url = %(DATABASE_MIGRATE_URL)s
|
||
```
|
||
|
||
**`migrations/env.py` async pattern** (from RESEARCH.md Pattern 2, lines 300–327):
|
||
```python
|
||
import asyncio
|
||
from sqlalchemy.ext.asyncio import async_engine_from_config
|
||
from sqlalchemy import pool
|
||
from alembic import context
|
||
from db.models import Base # noqa: F401 — must import to register all models
|
||
|
||
target_metadata = Base.metadata
|
||
|
||
def do_run_migrations(connection):
|
||
context.configure(connection=connection, target_metadata=target_metadata)
|
||
with context.begin_transaction():
|
||
context.run_migrations()
|
||
|
||
async def run_async_migrations():
|
||
connectable = async_engine_from_config(
|
||
config.get_section(config.config_ini_section, {}),
|
||
prefix="sqlalchemy.",
|
||
poolclass=pool.NullPool,
|
||
)
|
||
async with connectable.connect() as connection:
|
||
await connection.run_sync(do_run_migrations)
|
||
await connectable.dispose()
|
||
|
||
def run_migrations_online():
|
||
asyncio.run(run_async_migrations())
|
||
```
|
||
|
||
Generate the base file with `alembic init -t async migrations` — it produces this exact structure. Then add the `from db.models import Base` import and set `target_metadata = Base.metadata`.
|
||
|
||
---
|
||
|
||
## Shared Patterns
|
||
|
||
### Async/Await Convention
|
||
**Source:** `backend/main.py` (lines 10–13), `backend/api/documents.py` (lines 21–58)
|
||
**Apply to:** All new `db/`, `deps/`, `storage/`, `services/`, `tasks/` modules, all test files
|
||
|
||
All new code is `async def`. Synchronous SDK calls (MinIO) use `asyncio.to_thread()`. Celery task functions are the only exception: they must be plain `def` (see RESEARCH.md Pitfall: Celery tasks are synchronous).
|
||
|
||
### None-on-not-found Contract
|
||
**Source:** `backend/services/storage.py` (lines 34–38)
|
||
**Apply to:** `backend/services/storage.py` (rewritten), `backend/db/` query helpers
|
||
|
||
```python
|
||
def get_metadata(doc_id: str) -> dict | None:
|
||
...
|
||
if not path.exists():
|
||
return None
|
||
```
|
||
Async ORM equivalent:
|
||
```python
|
||
async def get_document(session: AsyncSession, doc_id: uuid.UUID) -> Document | None:
|
||
return await session.get(Document, doc_id)
|
||
```
|
||
Return `None` for not-found; let the API layer raise `HTTPException(404)`. Never raise exceptions from the service layer for expected missing-resource conditions.
|
||
|
||
### HTTP Error Pattern
|
||
**Source:** `backend/api/documents.py` (lines 74–77), `backend/api/topics.py` (lines 57–59)
|
||
**Apply to:** All API route handlers
|
||
```python
|
||
if meta is None:
|
||
raise HTTPException(404, "Document not found")
|
||
```
|
||
Use bare string messages (no `detail=` keyword) — consistent with existing code.
|
||
|
||
### Classification Failure Non-Fatal Pattern
|
||
**Source:** `backend/api/documents.py` (lines 50–56)
|
||
**Apply to:** `backend/api/documents.py` (updated upload handler)
|
||
```python
|
||
try:
|
||
topics = await classifier.classify_document(saved["id"])
|
||
meta["topics"] = topics
|
||
except Exception as e:
|
||
meta["classification_error"] = str(e) # classification failure is non-fatal
|
||
```
|
||
Document upload succeeds even if classification fails. Celery task failure equivalent: task enters FAILURE state but the document row remains with `status="pending"`.
|
||
|
||
### ABC + Factory Pattern
|
||
**Source:** `backend/ai/base.py` + `backend/ai/__init__.py` (lines 1–36)
|
||
**Apply to:** `backend/storage/base.py` + `backend/storage/__init__.py`
|
||
|
||
This is the project's established pattern for pluggable backends. Follow it exactly: separate `base.py` (ABC), `__init__.py` (factory function `get_X_backend()`), concrete implementations in separate modules.
|
||
|
||
---
|
||
|
||
## No Analog Found
|
||
|
||
Files with no close match in the codebase (planner should use RESEARCH.md patterns instead):
|
||
|
||
| File | Role | Data Flow | Reason |
|
||
|------|------|-----------|--------|
|
||
| `docker/postgres/initdb.d/01-init-users.sql` | config | batch | No SQL scripts exist in codebase; use RESEARCH.md Pattern 7 |
|
||
| `backend/celery_app.py` | config | event-driven | No task queue code exists; use RESEARCH.md Pattern 5 |
|
||
| `backend/alembic.ini` | config | batch | No Alembic config exists; generate with `alembic init -t async` |
|
||
| `backend/migrations/env.py` | config | batch | No migrations exist; use `alembic init -t async` output + RESEARCH.md Pattern 2 |
|
||
| `backend/migrations/versions/0001_initial_schema.py` | migration | batch | No migrations exist; use full schema from RESEARCH.md Code Examples (lines 769–908) |
|
||
| `backend/tests/test_storage.py` | test | file-I/O | No object storage tests exist; new file per RESEARCH.md Validation section |
|
||
|
||
---
|
||
|
||
## Metadata
|
||
|
||
**Analog search scope:** `backend/` (all `.py` files), `docker-compose.yml`, `.env.example`, `backend/requirements.txt`, `backend/Dockerfile`
|
||
**Files scanned:** 25
|
||
**Pattern extraction date:** 2026-05-21
|