705 lines
38 KiB
Markdown
705 lines
38 KiB
Markdown
# Architecture Research
|
||
|
||
**Domain:** Multi-user SaaS document management platform (FastAPI + Vue 3 brownfield migration)
|
||
**Researched:** 2026-05-21
|
||
**Confidence:** HIGH (auth DI pattern confirmed via official FastAPI docs; storage/DB patterns are well-established S3/PostgreSQL engineering standards cross-verified against official MinIO and SQLAlchemy docs)
|
||
|
||
---
|
||
|
||
## Standard Architecture
|
||
|
||
### System Overview
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────────┐
|
||
│ Browser (Vue 3 SPA) │
|
||
│ ┌───────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────────┐ │
|
||
│ │ auth store│ │ docs store │ │quota store│ │settings store│ │
|
||
│ └─────┬─────┘ └──────┬───────┘ └─────┬─────┘ └──────┬───────┘ │
|
||
│ └───────────────┴────────────────┴────────────────┘ │
|
||
│ api/client.js (Bearer token injected) │
|
||
└───────────────────────────────────┬──────────────────────────────────┘
|
||
│ HTTPS/JSON + multipart
|
||
┌─────────▼─────────┐
|
||
│ Load Balancer │ (future; optional now)
|
||
└────────┬──────────┘
|
||
┌───────────────────┼───────────────────┐
|
||
│ │ │
|
||
┌──────────▼──────┐ ┌─────────▼──────┐ ┌────────▼───────┐
|
||
│ FastAPI inst 1 │ │ FastAPI inst 2 │ │ FastAPI inst N │
|
||
│ (stateless) │ │ (stateless) │ │ (stateless) │
|
||
└──────────┬───────┘ └────────┬────────┘ └────────┬───────┘
|
||
└───────────────────┼───────────────────┘
|
||
┌─────────▼──────────┐
|
||
│ Shared Services │
|
||
┌──────────┴──────────────────────┴─────────┐
|
||
│ │
|
||
┌──────────▼──────────┐ ┌────────────────▼──────┐
|
||
│ PostgreSQL │ │ MinIO │
|
||
│ (users, docs, meta, │ │ (object storage, │
|
||
│ quotas, audit) │ │ one bucket per user │
|
||
└──────────────────────┘ │ OR prefix-per-user) │
|
||
└───────────────────────┘
|
||
│
|
||
┌──────────────────────────┼─────────────────┐
|
||
│ │ │
|
||
┌──────────▼────────┐ ┌────────────▼───────┐ ┌────▼──────┐
|
||
│ Cloud Storage │ │ OneDrive Adapter │ │ WebDAV │
|
||
│ Adapter (base) │ │ Google Drive │ │ Adapter │
|
||
└───────────────────┘ └────────────────────┘ └───────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Component Boundaries
|
||
|
||
| Component | Responsibility | Communicates With |
|
||
|-----------|---------------|-------------------|
|
||
| `api/auth.py` | Registration, login, token refresh, TOTP enroll/verify | `services/user_service.py`, DB |
|
||
| `api/documents.py` | Upload, list, get, delete, reclassify, share | `services/document_service.py`, quota dep |
|
||
| `api/folders.py` | Folder CRUD, move | `services/folder_service.py` |
|
||
| `api/storage_backends.py` | Connect/disconnect cloud accounts, list/browse | `services/cloud_service.py` |
|
||
| `api/admin.py` | User CRUD, quota adjustments, audit log, AI config | `services/admin_service.py` |
|
||
| `deps/auth.py` | `get_current_user` — verifies JWT, returns `User` model | DB, `jose`/`PyJWT` |
|
||
| `deps/quota.py` | `check_quota` — reads user's usage, raises 413 if exceeded | DB |
|
||
| `deps/db.py` | `get_db` — yields async SQLAlchemy session | PostgreSQL |
|
||
| `services/document_service.py` | Orchestrates extract → classify → store flow | `extractor`, `classifier`, `storage_service` |
|
||
| `services/storage_service.py` | Routes to MinIO or cloud adapter; enforces object key namespacing | MinIO, cloud adapters |
|
||
| `services/user_service.py` | Password hashing, TOTP provisioning, breach check | DB, `bcrypt`, `pyotp` |
|
||
| `services/quota_service.py` | Compute used bytes from DB, update after upload/delete | DB |
|
||
| `services/audit_service.py` | Append-only audit log writes | DB |
|
||
| `services/cloud_service.py` | Manage encrypted cloud credentials, proxy operations | Cloud adapters, DB |
|
||
| `storage/base.py` | `StorageBackend` ABC (mirrors `ai/base.py` pattern) | — |
|
||
| `storage/minio_backend.py` | MinIO S3 implementation | MinIO |
|
||
| `storage/onedrive_backend.py` | OneDrive Graph API implementation | Microsoft Graph |
|
||
| `storage/gdrive_backend.py` | Google Drive API implementation | Google Drive API |
|
||
| `storage/nextcloud_backend.py` | Nextcloud WebDAV implementation | WebDAV |
|
||
| `db/models.py` | SQLAlchemy ORM models | PostgreSQL |
|
||
| `db/migrations/` | Alembic migration history | — |
|
||
|
||
---
|
||
|
||
## Recommended Project Structure
|
||
|
||
```
|
||
backend/
|
||
├── main.py # FastAPI app factory, middleware, router registration
|
||
├── config.py # pydantic-settings: DB URL, MinIO creds, secret keys
|
||
├── deps/
|
||
│ ├── auth.py # get_current_user, get_current_admin
|
||
│ ├── db.py # get_db (async session dependency)
|
||
│ └── quota.py # check_upload_quota (raises 413 if exceeded)
|
||
├── api/
|
||
│ ├── auth.py # /auth/register, /auth/login, /auth/refresh, /auth/totp/*
|
||
│ ├── documents.py # /documents/* (existing routes, now user-scoped)
|
||
│ ├── folders.py # /folders/*
|
||
│ ├── storage_backends.py # /storage-backends/* (cloud account management)
|
||
│ └── admin.py # /admin/* (users, quotas, audit, AI config)
|
||
├── services/
|
||
│ ├── document_service.py # upload orchestration (extract → classify → store → quota)
|
||
│ ├── storage_service.py # routes uploads to correct StorageBackend
|
||
│ ├── quota_service.py # read/write quota usage
|
||
│ ├── user_service.py # user creation, password, TOTP
|
||
│ ├── audit_service.py # audit log writes
|
||
│ └── cloud_service.py # cloud backend credential management
|
||
├── storage/ # cloud storage adapter layer (mirrors ai/)
|
||
│ ├── base.py # StorageBackend ABC
|
||
│ ├── __init__.py # get_storage_backend() factory
|
||
│ ├── minio_backend.py # default local-S3 backend
|
||
│ ├── onedrive_backend.py
|
||
│ ├── gdrive_backend.py
|
||
│ └── nextcloud_backend.py # WebDAV-based
|
||
├── ai/ # unchanged — existing provider abstraction
|
||
│ └── ...
|
||
├── db/
|
||
│ ├── models.py # all SQLAlchemy ORM models
|
||
│ ├── session.py # async engine + sessionmaker
|
||
│ └── migrations/ # Alembic env + version scripts
|
||
└── tests/
|
||
```
|
||
|
||
### Structure Rationale
|
||
|
||
- **`deps/`:** FastAPI dependency functions isolated from service logic. Auth, DB session, and quota are injected independently — routes compose them without coupling.
|
||
- **`storage/`:** Direct mirror of `ai/` module. Same ABC + factory pattern. Existing team mental model applies immediately.
|
||
- **`db/`:** ORM models and session config separated from services, ensuring migrations can be run independently of app startup.
|
||
|
||
---
|
||
|
||
## Architectural Patterns
|
||
|
||
### Pattern 1: JWT Verification via Dependency Injection (not middleware)
|
||
|
||
**What:** JWT parsing and user lookup happens in `deps/auth.py::get_current_user`, injected via `Depends()` per route.
|
||
|
||
**When to use:** All authenticated routes. Admin routes additionally inject `get_current_admin` which calls `get_current_user` then checks `user.role == "admin"`.
|
||
|
||
**Trade-offs:** Unauthenticated routes (health check, login, register) require no special exclusion logic. Middleware-based auth forces you to maintain an allowlist of public routes — that list inevitably drifts. DI is opt-in per route, which is safer.
|
||
|
||
**Example:**
|
||
```python
|
||
# deps/auth.py
|
||
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="/auth/login")
|
||
|
||
async def get_current_user(
|
||
token: Annotated[str, Depends(oauth2_scheme)],
|
||
db: Annotated[AsyncSession, Depends(get_db)],
|
||
) -> User:
|
||
try:
|
||
payload = jwt.decode(token, settings.jwt_secret, algorithms=["HS256"])
|
||
user_id: str = payload.get("sub")
|
||
if user_id is None:
|
||
raise credentials_exception
|
||
except JWTError:
|
||
raise credentials_exception
|
||
user = await db.get(User, user_id)
|
||
if user is None or not user.is_active:
|
||
raise credentials_exception
|
||
return user
|
||
|
||
# api/documents.py
|
||
@router.get("/documents")
|
||
async def list_documents(
|
||
current_user: Annotated[User, Depends(get_current_user)],
|
||
db: Annotated[AsyncSession, Depends(get_db)],
|
||
):
|
||
...
|
||
```
|
||
|
||
**Confirmed:** HIGH confidence — FastAPI official documentation explicitly recommends this pattern over middleware for auth.
|
||
|
||
---
|
||
|
||
### Pattern 2: Refresh Token Rotation
|
||
|
||
**What:** Short-lived access tokens (15 min) + long-lived refresh tokens (30 days) stored in `refresh_tokens` table. On every `/auth/refresh` call, the old token is invalidated and a new pair is issued.
|
||
|
||
**When to use:** Always, for multi-user SaaS. Prevents stolen tokens having indefinite access.
|
||
|
||
**Trade-offs:** Requires `refresh_tokens` DB table and one extra DB write per refresh. The alternative (long-lived JWTs) cannot be revoked without a blocklist, which has the same cost.
|
||
|
||
**Implementation notes:**
|
||
- Refresh token = opaque random UUID (not JWT) — store hashed in DB alongside `user_id`, `expires_at`, `revoked`
|
||
- Access token = JWT with `sub=user_id`, `exp=now+15m`, `jti=uuid` (for optional blocklist future)
|
||
- On logout or password change: set `revoked=true` on all user's refresh tokens
|
||
- On TOTP failure after password success: do not issue any token; log failed_mfa audit event
|
||
|
||
---
|
||
|
||
### Pattern 3: MinIO Presigned URL Flow (preferred over streaming proxy)
|
||
|
||
**What:** FastAPI generates a short-lived presigned PUT URL from MinIO; the browser uploads directly to MinIO. For downloads, FastAPI generates a presigned GET URL and redirects.
|
||
|
||
**When to use:** All document uploads and downloads where the client is a browser on the same network as MinIO (typical Docker Compose deployment). Use streaming proxy only when MinIO is not reachable from the browser (e.g., MinIO is behind an internal network).
|
||
|
||
**Trade-offs:**
|
||
- Presigned URL avoids buffering the file through FastAPI — reduces memory pressure and latency significantly for large files.
|
||
- The FastAPI instance must be able to reach MinIO to generate the URL, but does not need to handle the byte stream.
|
||
- For Docker Compose: MinIO is on the internal Docker network; expose only the presigned-URL-generating endpoint externally. The presigned URL itself points to the MinIO public port.
|
||
|
||
**Flow:**
|
||
```
|
||
1. POST /documents/upload-url (FastAPI, authenticated)
|
||
→ quota check
|
||
→ generate presigned PUT URL (expires 5 min)
|
||
→ return { upload_url, object_key, document_id }
|
||
|
||
2. PUT <upload_url> (browser → MinIO directly)
|
||
→ no FastAPI involvement
|
||
|
||
3. POST /documents/confirm { document_id } (FastAPI, authenticated)
|
||
→ verify object exists in MinIO
|
||
→ trigger text extraction + classification (background task)
|
||
→ update document status to "processing"
|
||
→ return document record
|
||
```
|
||
|
||
**Object key namespace:** `{user_id}/{document_id}/{filename}` — ensures per-user isolation without separate buckets. One bucket (`docuvault-documents`) is sufficient; IAM policies or object key prefix checks enforce isolation in code.
|
||
|
||
**Presigned GET for downloads:**
|
||
```python
|
||
url = minio_client.presigned_get_object(
|
||
bucket_name="docuvault-documents",
|
||
object_name=f"{user_id}/{document_id}/{filename}",
|
||
expires=timedelta(minutes=30),
|
||
)
|
||
return RedirectResponse(url)
|
||
```
|
||
|
||
**Confidence:** HIGH for the S3 presigned URL pattern (standard across all S3-compatible stores). MinIO Python SDK `presigned_put_object` and `presigned_get_object` methods confirmed as stable API.
|
||
|
||
---
|
||
|
||
### Pattern 4: Cloud Storage Adapter (StorageBackend ABC)
|
||
|
||
**What:** A `StorageBackend` ABC in `storage/base.py` defines the interface. Each cloud integration implements it. `storage_service.py` routes to the correct backend based on the user's `default_storage_backend` setting.
|
||
|
||
**When to use:** Any operation that reads or writes document bytes. The service layer never calls MinIO or Google Drive directly — always via the adapter.
|
||
|
||
**Interface:**
|
||
```python
|
||
# storage/base.py
|
||
from abc import ABC, abstractmethod
|
||
from typing import AsyncIterator
|
||
|
||
class StorageBackend(ABC):
|
||
@abstractmethod
|
||
async def put_object(self, key: str, data: bytes, content_type: str) -> str:
|
||
"""Store object, return canonical reference (URL or key)."""
|
||
|
||
@abstractmethod
|
||
async def get_object(self, key: str) -> bytes:
|
||
"""Retrieve object bytes."""
|
||
|
||
@abstractmethod
|
||
async def delete_object(self, key: str) -> None:
|
||
"""Delete object."""
|
||
|
||
@abstractmethod
|
||
async def get_presigned_url(self, key: str, expires_seconds: int = 3600) -> str | None:
|
||
"""Return a time-limited direct URL, or None if backend doesn't support it."""
|
||
|
||
@abstractmethod
|
||
async def list_objects(self, prefix: str) -> list[str]:
|
||
"""List keys under prefix."""
|
||
|
||
@abstractmethod
|
||
async def health_check(self) -> bool:
|
||
"""Verify connectivity."""
|
||
```
|
||
|
||
**Factory:**
|
||
```python
|
||
# storage/__init__.py
|
||
def get_storage_backend(user: User, credentials: dict | None) -> StorageBackend:
|
||
backend_type = user.default_storage_backend # "minio" | "onedrive" | "gdrive" | ...
|
||
if backend_type == "minio":
|
||
return MinIOBackend(settings.minio_endpoint, ...)
|
||
elif backend_type == "onedrive":
|
||
return OneDriveBackend(credentials) # decrypted before passing in
|
||
...
|
||
```
|
||
|
||
**Credentials encryption:** Cloud OAuth tokens and refresh tokens are stored encrypted with Fernet symmetric encryption. The key is in `CLOUD_CREDS_KEY` env var. Encryption/decryption happens in `cloud_service.py` before the credentials are passed to the backend constructor — the adapter itself always receives plaintext credentials and never touches the DB.
|
||
|
||
---
|
||
|
||
### Pattern 5: Storage Quota Enforcement via Service Layer (not middleware, not DB constraint)
|
||
|
||
**What:** Quota is checked in `deps/quota.py::check_upload_quota` — a FastAPI dependency injected on upload routes. After successful upload, `quota_service.increment_usage(user_id, bytes)` is called.
|
||
|
||
**Where NOT to enforce:**
|
||
- **Not in middleware:** Middleware cannot easily read the `Content-Length` before the body is buffered, and cannot know user identity without re-implementing auth.
|
||
- **Not as a DB constraint:** `CHECK (used_bytes <= limit_bytes)` would require the DB to reject the commit, creating a race between the object already uploaded to MinIO and the metadata not committed. Inconsistency.
|
||
|
||
**Correct sequence:**
|
||
```
|
||
1. Pre-upload: deps/quota.py reads user.quota_used_bytes + Content-Length header
|
||
→ if (used + incoming) > limit_bytes: raise HTTP 413 with quota detail
|
||
|
||
2. Upload proceeds to MinIO (presigned URL or proxy)
|
||
|
||
3. Post-upload: quota_service.increment_usage atomically:
|
||
UPDATE quotas SET used_bytes = used_bytes + $delta
|
||
WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes
|
||
RETURNING used_bytes
|
||
→ if no rows returned: another concurrent upload exceeded quota; delete from MinIO + 413
|
||
```
|
||
|
||
**Why atomic update with check:** Two simultaneous uploads can both pass the pre-check. The atomic UPDATE with WHERE guard prevents double-spend. This is the correct pattern for optimistic quota enforcement under concurrency.
|
||
|
||
---
|
||
|
||
## PostgreSQL Schema Design
|
||
|
||
### Core Tables
|
||
|
||
```sql
|
||
-- Users
|
||
CREATE TABLE users (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
handle TEXT UNIQUE NOT NULL, -- @username for sharing
|
||
email TEXT UNIQUE NOT NULL,
|
||
password_hash TEXT NOT NULL, -- bcrypt
|
||
totp_secret TEXT, -- NULL = TOTP not enabled
|
||
totp_enabled BOOLEAN NOT NULL DEFAULT FALSE,
|
||
role TEXT NOT NULL DEFAULT 'user', -- 'user' | 'admin'
|
||
is_active BOOLEAN NOT NULL DEFAULT TRUE,
|
||
ai_provider TEXT, -- NULL = use system default
|
||
ai_model TEXT,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
|
||
-- Quotas (1:1 with users; separate for clean admin queries)
|
||
CREATE TABLE quotas (
|
||
user_id UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
|
||
limit_bytes BIGINT NOT NULL DEFAULT 104857600, -- 100 MB
|
||
used_bytes BIGINT NOT NULL DEFAULT 0,
|
||
CONSTRAINT no_negative_usage CHECK (used_bytes >= 0)
|
||
);
|
||
|
||
-- Refresh tokens
|
||
CREATE TABLE refresh_tokens (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||
token_hash TEXT NOT NULL UNIQUE, -- SHA-256 of the opaque token
|
||
expires_at TIMESTAMPTZ NOT NULL,
|
||
revoked BOOLEAN NOT NULL DEFAULT FALSE,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
CREATE INDEX ON refresh_tokens(user_id, revoked);
|
||
|
||
-- Folders
|
||
CREATE TABLE folders (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||
parent_id UUID REFERENCES folders(id) ON DELETE CASCADE, -- NULL = root
|
||
name TEXT NOT NULL,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
UNIQUE (user_id, parent_id, name)
|
||
);
|
||
|
||
-- Documents
|
||
CREATE TABLE documents (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||
folder_id UUID REFERENCES folders(id) ON DELETE SET NULL,
|
||
filename TEXT NOT NULL,
|
||
content_type TEXT NOT NULL,
|
||
size_bytes BIGINT NOT NULL DEFAULT 0,
|
||
storage_backend TEXT NOT NULL DEFAULT 'minio', -- 'minio' | 'onedrive' | ...
|
||
object_key TEXT NOT NULL, -- backend-specific reference
|
||
extracted_text TEXT, -- NULL until extraction complete
|
||
status TEXT NOT NULL DEFAULT 'pending', -- pending | processing | ready | error
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
CREATE INDEX ON documents(user_id, folder_id);
|
||
CREATE INDEX ON documents(user_id, created_at DESC);
|
||
|
||
-- Document topics (M:N)
|
||
CREATE TABLE document_topics (
|
||
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
||
topic_id UUID NOT NULL REFERENCES topics(id) ON DELETE CASCADE,
|
||
PRIMARY KEY (document_id, topic_id)
|
||
);
|
||
|
||
-- Topics (per-user; admin sets defaults)
|
||
CREATE TABLE topics (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
user_id UUID REFERENCES users(id) ON DELETE CASCADE, -- NULL = system default
|
||
name TEXT NOT NULL,
|
||
UNIQUE (user_id, name)
|
||
);
|
||
|
||
-- Document shares
|
||
CREATE TABLE document_shares (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
||
owner_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||
recipient_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||
permission TEXT NOT NULL DEFAULT 'view', -- 'view' | 'download' (future)
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
UNIQUE (document_id, recipient_id)
|
||
);
|
||
CREATE INDEX ON document_shares(recipient_id);
|
||
|
||
-- Cloud storage backends per user
|
||
CREATE TABLE cloud_backends (
|
||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
|
||
backend_type TEXT NOT NULL, -- 'onedrive' | 'gdrive' | 'nextcloud' | 'webdav'
|
||
display_name TEXT NOT NULL,
|
||
credentials_enc TEXT NOT NULL, -- Fernet-encrypted JSON blob
|
||
is_default BOOLEAN NOT NULL DEFAULT FALSE,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
CREATE INDEX ON cloud_backends(user_id);
|
||
|
||
-- Audit log (append-only)
|
||
CREATE TABLE audit_log (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
user_id UUID REFERENCES users(id) ON DELETE SET NULL,
|
||
actor_id UUID REFERENCES users(id) ON DELETE SET NULL, -- admin acting on behalf
|
||
event_type TEXT NOT NULL, -- login | login_failed | upload | delete | share | quota_change | ...
|
||
resource_id UUID, -- document_id / folder_id / user_id depending on context
|
||
ip_address INET,
|
||
metadata JSONB, -- event-specific extra fields (no document content)
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
CREATE INDEX ON audit_log(user_id, created_at DESC);
|
||
CREATE INDEX ON audit_log(event_type, created_at DESC);
|
||
-- NOTE: no UPDATE or DELETE grants on audit_log for app user; only INSERT + SELECT
|
||
```
|
||
|
||
**Schema design notes:**
|
||
- `topics.user_id IS NULL` = system-wide default topics visible to all users; per-user topics shadow them.
|
||
- `documents.object_key` stores the backend-relative reference — for MinIO it is `{user_id}/{document_id}/{filename}`; for OneDrive it is the Drive item ID. The `storage_backend` column tells the service which adapter to use.
|
||
- `cloud_backends.credentials_enc` is never returned in any API response; only the adapter factory decrypts it server-side.
|
||
- Audit log uses `BIGSERIAL` (not UUID) for append-ordered natural scan and to discourage random access patterns.
|
||
|
||
---
|
||
|
||
## Data Flow
|
||
|
||
### Document Upload Flow (MinIO presigned URL path)
|
||
|
||
```
|
||
Browser
|
||
│
|
||
├─[1] POST /documents/upload-url {filename, size, content_type, folder_id?}
|
||
│ → get_current_user dep (JWT verify → load User from DB)
|
||
│ → check_upload_quota dep (reads quotas table, compares size)
|
||
│ → document_service.prepare_upload()
|
||
│ → INSERT documents row (status='pending')
|
||
│ → minio_backend.generate_presigned_put(object_key, expires=300s)
|
||
│ ← {upload_url, object_key, document_id}
|
||
│
|
||
├─[2] PUT <upload_url> (browser → MinIO, no FastAPI)
|
||
│
|
||
├─[3] POST /documents/{id}/confirm
|
||
│ → get_current_user dep
|
||
│ → document_service.confirm_upload()
|
||
│ → verify object exists in MinIO (HEAD request)
|
||
│ → quota_service.increment_usage(user_id, size_bytes) [atomic]
|
||
│ → UPDATE documents SET status='processing'
|
||
│ → enqueue background task: extract_and_classify(document_id)
|
||
│ ← {document_id, status: "processing"}
|
||
│
|
||
└─[4] Background: extract_and_classify(document_id)
|
||
→ extractor.extract_text(object bytes from MinIO)
|
||
→ classifier.classify(text, user_topics)
|
||
→ UPDATE documents SET extracted_text=..., status='ready'
|
||
→ UPDATE document_topics
|
||
→ audit_service.log(event='upload', ...)
|
||
```
|
||
|
||
### Authentication Flow
|
||
|
||
```
|
||
Browser
|
||
│
|
||
├─[1] POST /auth/login {email, password, totp_code?}
|
||
│ → user_service.verify_password(email, password)
|
||
│ → if totp_enabled: pyotp.TOTP(secret).verify(totp_code)
|
||
│ → issue access_token (JWT, 15 min) + refresh_token (opaque UUID)
|
||
│ → store hash(refresh_token) in refresh_tokens table
|
||
│ ← {access_token, refresh_token, expires_in}
|
||
│
|
||
├─[2] Any authenticated request
|
||
│ Authorization: Bearer <access_token>
|
||
│ → get_current_user dep decodes JWT locally (no DB round-trip for valid tokens)
|
||
│
|
||
└─[3] POST /auth/refresh {refresh_token}
|
||
→ look up hash(refresh_token) in refresh_tokens table
|
||
→ verify not revoked, not expired
|
||
→ set revoked=true on old token
|
||
→ issue new access_token + new refresh_token (rotation)
|
||
← {access_token, refresh_token, expires_in}
|
||
```
|
||
|
||
### Shared Document Access Flow
|
||
|
||
```
|
||
Recipient accesses "Shared with me"
|
||
│
|
||
├─ GET /documents/shared-with-me
|
||
│ → SELECT d.* FROM documents d
|
||
│ JOIN document_shares s ON s.document_id = d.id
|
||
│ WHERE s.recipient_id = :current_user_id
|
||
│ ← list of document records (owner's documents, recipient has view access)
|
||
│
|
||
└─ GET /documents/{id}/download (recipient, shared document)
|
||
→ verify document_shares row exists for (document_id, current_user_id)
|
||
→ generate presigned GET URL using owner's object_key
|
||
← 302 redirect to presigned URL
|
||
(file bytes flow from MinIO → browser, never through FastAPI)
|
||
```
|
||
|
||
---
|
||
|
||
## Migration Path: Flat-File → PostgreSQL + MinIO
|
||
|
||
### Principle: parallel-run, not flag-day cutover
|
||
|
||
The safest approach is to keep the existing flat-file code running and introduce the new stack incrementally, in a sequence that never breaks the existing API contract from the Vue frontend's perspective.
|
||
|
||
### Phase 1 — Infrastructure, no behavior change
|
||
|
||
1. Add PostgreSQL and MinIO services to `docker-compose.yml`
|
||
2. Create `db/models.py` with initial schema (users, documents, quotas — no auth yet)
|
||
3. Add Alembic, run initial migration
|
||
4. Add `deps/db.py` with async session dependency
|
||
5. No API changes. Existing flat-file code still runs.
|
||
|
||
**Validation:** `docker-compose up` boots all services without errors. Alembic migrations apply cleanly.
|
||
|
||
### Phase 2 — Auth layer (new endpoints, existing endpoints temporarily open)
|
||
|
||
1. Add `users` table, `refresh_tokens` table
|
||
2. Implement `/auth/register`, `/auth/login`, `/auth/refresh`
|
||
3. Add `get_current_user` dependency to `deps/auth.py`
|
||
4. Add a single test authenticated route (`GET /auth/me`)
|
||
5. Existing document endpoints remain unauthenticated (guarded by feature flag or separate router prefix)
|
||
|
||
**Validation:** Auth endpoints work independently. Existing UI still calls existing routes without tokens.
|
||
|
||
### Phase 3 — Document storage migration (dual-write period)
|
||
|
||
1. Add MinIO `minio_backend.py`, integrate into `storage_service.py`
|
||
2. Create a one-time migration script:
|
||
- Reads each `data/metadata/<id>.json`
|
||
- Inserts a `documents` row with `user_id = SYSTEM_USER_ID` (a single placeholder user)
|
||
- Uploads `data/uploads/<id>.<ext>` to MinIO
|
||
3. New uploads go to MinIO + PostgreSQL; old flat-file data has been migrated
|
||
4. Update document API routes to read from PostgreSQL + MinIO, guarded behind `get_current_user`
|
||
|
||
**Critical:** Run migration script in a transaction; if any file fails, roll back DB inserts (MinIO objects can be cleaned up separately). Do not delete flat-file data until validation is complete.
|
||
|
||
### Phase 4 — Multi-user isolation
|
||
|
||
1. Add per-user `quotas` rows, `folders` table
|
||
2. Enforce user-scoped queries: all document queries include `WHERE user_id = :current_user_id`
|
||
3. Add quota enforcement dependency on upload routes
|
||
4. Add `document_shares`, `cloud_backends` tables
|
||
|
||
### Phase 5 — Cloud storage backends
|
||
|
||
1. Implement `StorageBackend` ABC and `MinIOBackend`
|
||
2. Implement first cloud adapter (OneDrive or Google Drive)
|
||
3. Add `/storage-backends/*` API endpoints
|
||
4. Add frontend UI for connecting cloud accounts
|
||
|
||
### Frontend changes during migration
|
||
|
||
- Add `Authorization: Bearer <token>` header injection in `src/api/client.js` (single change point — all API calls go through this module already)
|
||
- Add login/register views and auth Pinia store
|
||
- Redirect to `/login` on 401 responses
|
||
- No other frontend changes required until Phase 4 (user-scoped UI)
|
||
|
||
---
|
||
|
||
## Horizontal Scaling Concerns
|
||
|
||
| Concern | What to Share | What Can Be Instance-Local |
|
||
|---------|--------------|---------------------------|
|
||
| DB connections | PostgreSQL (shared) — use connection pooling (asyncpg pool size 10-20 per instance) | None |
|
||
| Object storage | MinIO (shared) — all instances use same endpoint | None |
|
||
| Refresh token state | PostgreSQL `refresh_tokens` table (shared) | JWT validation (CPU-only, no shared state needed) |
|
||
| Quota state | PostgreSQL `quotas` table with atomic UPDATE (shared) | Pre-flight Content-Length check (instance-local read, final write shared) |
|
||
| Background tasks | Cannot use `BackgroundTasks` across instances — use Celery + Redis OR Postgres-backed queue (pg_boss / pgqueuer) | Single-instance: `BackgroundTasks` is fine for Phase 1 |
|
||
| File upload temp buffers | If streaming proxy pattern used: RAM per instance | Use presigned URLs to avoid this entirely |
|
||
| AI provider instances | Re-instantiated per request already — no shared state | Per-instance re-instantiation is fine |
|
||
| CORS / session | Stateless JWT — no sticky sessions needed | — |
|
||
|
||
**First bottleneck:** Background task queue. `FastAPI BackgroundTasks` runs in the same process. When classification is slow or multiple uploads arrive simultaneously, workers block. Introduce a task queue (Celery + Redis, or pgqueuer) before scaling to N instances — otherwise each instance has its own queue and tasks are not distributed.
|
||
|
||
**Second bottleneck:** DB connection count. With N instances × 20 connections = N×20 PostgreSQL connections. Add PgBouncer in transaction mode in front of PostgreSQL before N gets large.
|
||
|
||
---
|
||
|
||
## Anti-Patterns
|
||
|
||
### Anti-Pattern 1: Per-Instance File Locks for Quota
|
||
|
||
**What people do:** Carry forward the `filelock` pattern into the multi-instance world, using a lock file on a shared volume.
|
||
|
||
**Why it's wrong:** Shared NFS/volume file locking has undefined behavior under Docker Compose networking, requires a shared filesystem mount (kills stateless instances), and is slower than a DB atomic update.
|
||
|
||
**Do this instead:** Atomic `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes` in PostgreSQL. Single round-trip, correct under concurrency, no shared filesystem required.
|
||
|
||
---
|
||
|
||
### Anti-Pattern 2: Streaming All File Traffic Through FastAPI
|
||
|
||
**What people do:** `POST /upload` receives the multipart body into memory, then POSTs it to MinIO from FastAPI.
|
||
|
||
**Why it's wrong:** Doubles memory usage (once in FastAPI, once in MinIO client buffer). Saturates FastAPI worker threads during large uploads. Introduces FastAPI as a bottleneck for byte transfer.
|
||
|
||
**Do this instead:** Two-step presigned URL flow (Pattern 3 above). FastAPI only handles metadata; bytes flow browser → MinIO directly.
|
||
|
||
---
|
||
|
||
### Anti-Pattern 3: Auth in Middleware Instead of Dependencies
|
||
|
||
**What people do:** Write a custom ASGI middleware that reads the `Authorization` header and either passes or rejects requests.
|
||
|
||
**Why it's wrong:** Middleware runs before FastAPI routing. To allow public routes (login, register, health), you must maintain an exclusion list in the middleware. This list inevitably goes stale when new public routes are added. Middleware cannot easily populate `request.state.user` in a way that's type-safe for path operations.
|
||
|
||
**Do this instead:** `Depends(get_current_user)` on each protected router. Optional auth uses `Depends(get_optional_user)` returning `User | None`. Explicit, type-safe, co-located with the route it protects. Confirmed as FastAPI's recommended pattern.
|
||
|
||
---
|
||
|
||
### Anti-Pattern 4: Storing Cloud Credentials Unencrypted (or Relying on DB-Level Encryption Alone)
|
||
|
||
**What people do:** Store OAuth tokens in plaintext DB columns, assuming DB-level TLS or disk encryption is sufficient.
|
||
|
||
**Why it's wrong:** Any user with DB read access (admin, compromised migration, backup leak) can extract all users' cloud tokens. Violates the privacy-first admin model requirement.
|
||
|
||
**Do this instead:** Fernet-encrypt the credential JSON blob in `cloud_service.py` before writing to `cloud_backends.credentials_enc`. The Fernet key lives in `CLOUD_CREDS_KEY` env var only — never in the DB. Admin queries on `cloud_backends` return only `id`, `backend_type`, `display_name`, `is_default` — the `credentials_enc` column is excluded from all admin-facing serializers.
|
||
|
||
---
|
||
|
||
### Anti-Pattern 5: One MinIO Bucket Per User
|
||
|
||
**What people do:** Create a new MinIO bucket for each registered user to enforce isolation.
|
||
|
||
**Why it's wrong:** MinIO is not designed for millions of buckets. Bucket creation is a management operation. IAM policies per bucket become complex to manage at scale.
|
||
|
||
**Do this instead:** Single bucket, key prefix isolation: `{user_id}/{document_id}/{filename}`. Enforce prefix scoping in `storage_service.py` — never let a user-supplied key escape their `{user_id}/` prefix. Verify in every `get_object` and `delete_object` call that the resolved key starts with the authenticated user's ID.
|
||
|
||
---
|
||
|
||
## Integration Points
|
||
|
||
### External Services
|
||
|
||
| Service | Integration Pattern | Notes |
|
||
|---------|---------------------|-------|
|
||
| PostgreSQL | SQLAlchemy 2.0 async (`asyncpg` driver), sessions via `Depends(get_db)` | Use `asyncpg` pool, not per-request connections |
|
||
| MinIO | `minio` Python SDK (sync) wrapped in `asyncio.to_thread()`, or `aiobotocore` for async S3 | Presigned URL generation is CPU-bound, not I/O-bound — `to_thread` is fine |
|
||
| OneDrive | Microsoft Graph API via `httpx` async client + OAuth2 PKCE flow | Refresh tokens stored encrypted in `cloud_backends` |
|
||
| Google Drive | Google Drive API v3 via `httpx` or `google-auth` library | Same credential model as OneDrive |
|
||
| Nextcloud | WebDAV via `httpx` (PUT/GET/DELETE) or `webdavclient3` library | Basic auth or app password — simpler than OAuth |
|
||
| PyOTP | TOTP generation/verification (`pyotp.TOTP(secret).verify(code)`) | Time-window tolerance: default ±1 period (±30 sec) is sufficient |
|
||
| `python-jose` or `PyJWT` | JWT encode/decode | Use `HS256` with a 256-bit secret. `python-jose` has broader algorithm support; `PyJWT` is simpler and more actively maintained |
|
||
| `cryptography` (Fernet) | Cloud credential encryption/decryption | `Fernet.generate_key()` at setup; store in `CLOUD_CREDS_KEY` env var |
|
||
| `passlib[bcrypt]` | Password hashing | bcrypt work factor 12 minimum |
|
||
|
||
### Internal Boundaries
|
||
|
||
| Boundary | Communication | Notes |
|
||
|----------|---------------|-------|
|
||
| `api/` ↔ `services/` | Direct async function calls | Services never import from `api/`; dependency is one-directional |
|
||
| `services/` ↔ `storage/` | `StorageBackend` ABC interface | Services import from `storage/__init__.py` factory only |
|
||
| `services/` ↔ `ai/` | Existing `get_provider()` factory — unchanged | AI provider is still re-instantiated per call |
|
||
| `deps/` ↔ `services/` | Services can be called from deps (e.g., quota_service from quota dep) | Keep deps thin — prefer passing a DB session to the dep and calling service functions |
|
||
| `db/models.py` ↔ everywhere | Import models directly | No repository pattern needed at this scale; SQLAlchemy session + models is sufficient |
|
||
|
||
---
|
||
|
||
## Scaling Considerations
|
||
|
||
| Scale | Architecture | Notes |
|
||
|-------|-------------|-------|
|
||
| 1–100 users | Single FastAPI instance, `BackgroundTasks`, no queue | This milestone's target; simplest path |
|
||
| 100–10k users | Add Celery + Redis for background tasks; add PgBouncer; scale FastAPI to 2–4 instances | Background task queue is the first change needed |
|
||
| 10k–100k users | Read replica for PostgreSQL (document listing queries), MinIO multi-node cluster | Document metadata reads dominate; separate read/write paths |
|
||
| 100k+ users | Consider separate microservice for classification (GPU workers); CDN in front of MinIO presigned URLs | Classification latency becomes user-facing bottleneck |
|
||
|
||
---
|
||
|
||
## Sources
|
||
|
||
- FastAPI official docs — Security / OAuth2 with JWT: https://fastapi.tiangolo.com/tutorial/security/oauth2-jwt/ (HIGH confidence — directly confirmed pattern)
|
||
- FastAPI official docs — Advanced Middleware: https://fastapi.tiangolo.com/advanced/middleware/ (HIGH confidence — confirms DI > middleware for auth)
|
||
- FastAPI official docs — SQL Databases: https://fastapi.tiangolo.com/tutorial/sql-databases/ (HIGH confidence — session-per-request via Depends confirmed)
|
||
- MinIO S3 presigned URL pattern: S3-compatible standard, documented in AWS S3 and MinIO docs (HIGH confidence — industry-standard pattern)
|
||
- PostgreSQL atomic UPDATE for quota enforcement: standard optimistic concurrency pattern (HIGH confidence)
|
||
- Fernet symmetric encryption (`cryptography` library): well-documented Python standard for symmetric key encryption (HIGH confidence)
|
||
- Refresh token rotation pattern: IETF OAuth 2.0 Security BCP (RFC 9700 / draft-ietf-oauth-security-topics) (HIGH confidence)
|
||
|
||
---
|
||
*Architecture research for: DocuVault multi-user SaaS document management (FastAPI + Vue 3 brownfield)*
|
||
*Researched: 2026-05-21*
|