docs(05): capture phase 5 context — cloud storage backends

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
curo1305
2026-05-28 17:52:25 +02:00
parent d6f742a3c1
commit 358af367f3
2 changed files with 307 additions and 0 deletions
@@ -0,0 +1,148 @@
# Phase 5: Cloud Storage Backends - Context
**Gathered:** 2026-05-28
**Status:** Ready for planning
<domain>
## Phase Boundary
Users can connect OneDrive, Google Drive, Nextcloud, or a generic WebDAV server as a personal storage backend through the DocuVault web UI. Connected cloud providers appear alongside local MinIO folders in the existing sidebar folder tree. Credentials are encrypted per-user via HKDF. Connection status is visible and manageable from a new "Cloud Storage" tab in SettingsView. Local MinIO storage and all connected cloud backends coexist — no document migration. The `StorageBackend` ABC is extended with four new concrete implementations.
**All 4 providers ship in this phase** — no phased delivery.
</domain>
<decisions>
## Implementation Decisions
### Backend Scope
- **D-01:** All 4 providers (OneDrive/Microsoft Graph, Google Drive v3, Nextcloud, WebDAV) are delivered in this single phase.
- **D-02:** Each provider is a concrete `StorageBackend` subclass in `backend/storage/` (e.g., `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`). The existing ABC's 7 abstract methods define the contract.
### OAuth Flow (Google Drive & OneDrive)
- **D-03:** FastAPI owns the OAuth callback. Flow: user clicks "Connect" in SettingsView → redirected to provider's OAuth consent page → provider redirects to `GET /api/cloud/oauth/callback/{provider}?code=…&state=…` → FastAPI exchanges code for tokens, encrypts credentials, saves to `cloud_connections`, then redirects browser to Vue settings page with `?cloud_connected=google_drive` (or `?cloud_error=…`). The auth code and tokens never land in the frontend.
- **D-04:** OAuth state parameter must encode the authenticated user's ID (signed or encrypted) to prevent CSRF on the callback. Use `secrets.token_urlsafe(32)` + a short-lived server-side state store (Redis or DB) to validate the callback matches the initiating user session.
- **D-05:** Access token refresh is **on-demand and transparent**. When a cloud API call fails with a token-expiry error (HTTP 401 / provider-specific error), the backend catches it, uses the stored refresh token to obtain a new access token, updates `credentials_enc` in the DB, and retries the original call — all within the same request. The user experiences no interruption.
- **D-06:** If the refresh token itself is rejected by the provider (`invalid_grant` or equivalent), the connection status transitions to `REQUIRES_REAUTH` and the request returns an error telling the user to reconnect. No silent failure.
### Nextcloud & WebDAV Credentials
- **D-07:** The UI presents both auth methods — real account password and app-specific password — with an explanation of trade-offs and a clear recommendation for app password. The backend stores whichever the user provides (both use HTTP Basic Auth). The recommendation text: app passwords can be revoked individually without changing the main account password.
- **D-08:** On save, the backend validates the WebDAV/Nextcloud connection (a lightweight PROPFIND or OPTIONS request) before storing credentials. If validation fails, return an error — never store unverified credentials.
### Storage Selection & Coexistence
- **D-09:** The sidebar folder tree shows local MinIO folders first, then each connected cloud provider as a peer top-level node (e.g., "Google Drive", "My Nextcloud"). Lazy-load one level at a time: when the user expands a cloud node, the backend fetches the first level of that provider's folder tree via the cloud API.
- **D-10:** Upload destination follows the **active folder context**. If the user is viewing a local folder, uploads go to MinIO. If they are viewing a cloud provider folder, uploads go to that cloud provider via FastAPI intermediary (no direct browser-to-cloud upload). The `documents.storage_backend` column already exists to record which backend holds each document.
- **D-11:** Existing MinIO documents stay in MinIO — no migration. Local and cloud documents coexist. `document.storage_backend = "minio"` for existing docs; new cloud docs get `storage_backend = "google_drive"` etc.
- **D-12:** Cloud provider management lives in a new **"Cloud Storage" tab in SettingsView**. The tab shows: all supported providers; connection status badge (`ACTIVE` / `REQUIRES_REAUTH` / `ERROR` / not connected); "Connect" button for unconnected providers; per-connection "Disconnect" button; a "Disconnect all" action.
- **D-13:** Multiple cloud providers can be connected simultaneously (one row per provider in `cloud_connections`). Each provider's tree appears as its own top-level node in the sidebar.
### Cloud Document Upload
- **D-14:** For cloud backends, file bytes go through FastAPI first (`POST /api/documents/upload` detects the target backend from the active folder context), then FastAPI calls the cloud provider API to store them. The presigned-PUT-URL flow (used for MinIO) is **not used** for cloud backends. The `generate_presigned_put_url` method on cloud `StorageBackend` implementations can raise `NotImplementedError` — the upload endpoint detects cloud backends and uses the direct upload path.
### Cloud Document Retrieval
- **D-15:** Document downloads/previews use the **same `GET /api/documents/{id}/content` proxy endpoint** regardless of storage backend. The endpoint calls `storage_backend.get_object(document.object_key)` and streams the bytes to the browser. The frontend does not know or care which backend holds the file.
- **D-16:** Cloud folder tree browsing is **live API calls** (no DB sync). A **60-second in-memory TTL cache** (keyed by `user_id + provider + folder_path`) prevents redundant calls when the user collapses and re-expands the same node within one minute. The cache lives in FastAPI application state (or `functools.lru_cache`-equivalent with TTL). Not Redis — in-memory is sufficient for a single-user session pattern.
### SSRF Prevention
- **D-17:** All outbound HTTP calls to WebDAV/Nextcloud use a URL allowlist: the server URL provided by the user must pass hostname validation (not `localhost`, `127.x`, `169.254.x`, private RFC 1918 ranges, or `::1`). Validation runs at connect-time and before every request. Implemented in a shared `validate_cloud_url()` utility — all WebDAV/Nextcloud backends call it before constructing requests.
### Security Invariants (carry-forward)
- **D-18:** `credentials_enc` is encrypted with HKDF per-user key derivation (`HKDF(CLOUD_CREDS_KEY, salt=user_id_bytes, info=b"cloud-credentials")`). The master key lives in the `CLOUD_CREDS_KEY` env var. Never stored unencrypted. Never returned in any API response.
- **D-19:** Admin API responses for cloud connections return only `provider, display_name, connected_at, status` (the existing `CloudConnectionOut` Pydantic whitelist pattern from Phase 4).
### Claude's Discretion
- Choice of Python OAuth client library for Google Drive and OneDrive (e.g., `google-auth-oauthlib`, `msal`) — Claude selects based on PyPI availability and Phase 5 open question in STATE.md ("Verify cloud SDK minor versions on PyPI before Phase 5 pinning").
- Choice of WebDAV Python library (e.g., `webdavclient3`, `aiohttp` with manual PROPFIND) — Claude selects based on async compatibility.
- Exact TTL cache implementation (dict + timestamp vs. `cachetools.TTLCache`) — Claude picks the simplest approach with no new dependency if possible.
- OAuth state store implementation (Redis vs. short-lived DB row vs. signed JWT) — Claude selects based on what's already wired in the stack.
</decisions>
<canonical_refs>
## Canonical References
**Downstream agents MUST read these before planning or implementing.**
### Storage Backend Contract
- `backend/storage/base.py``StorageBackend` ABC: 7 abstract methods that all new cloud backends must implement. Note: `generate_presigned_put_url` raises `NotImplementedError` for cloud backends (D-14).
- `backend/storage/__init__.py``get_storage_backend()` factory: Phase 5 must extend this to resolve the correct backend from the document's `storage_backend` field and the user's active context.
- `backend/storage/minio_backend.py` — Reference implementation of `StorageBackend` — patterns for `asyncio.to_thread()` wrapping and error handling.
### Data Model
- `backend/db/models.py``CloudConnection` model (fields: `id`, `user_id`, `provider`, `display_name`, `credentials_enc`, `status`, `connected_at`). The `cloud_connections` table already exists from the Phase 1 migration. Also see `Document` model — `storage_backend` column records which backend holds each document.
### Requirements
- `.planning/REQUIREMENTS.md` — CLOUD-01 through CLOUD-07 (the 7 cloud storage requirements for this phase).
- `.planning/ROADMAP.md` — Phase 5 goal, success criteria, and phase gates (SSRF test, credential encryption round-trip, admin response never exposing `credentials_enc`, OAuth `invalid_grant` handling).
### Security Protocol
- `CLAUDE.md` §"Key Architectural Rules" — HKDF per-user key derivation pattern, SSRF allowlist requirement, `credentials_enc` never in API responses.
- `CLAUDE.md` §"Security Protocol" — SSRF section: "user-supplied URLs for WebDAV/Nextcloud must pass hostname allowlist".
### AI Provider Pattern (structural analog)
- `backend/ai/base.py``AIProvider` ABC: Phase 5 cloud backends mirror this pattern (ABC + factory + per-provider file).
- `backend/ai/__init__.py``get_provider()` factory pattern to mirror in `get_storage_backend()` extension.
### Frontend Patterns
- `frontend/src/stores/` — Pinia store patterns established in Phases 24 (auth store, folders store). Cloud connections store follows same pattern.
- `frontend/src/views/SettingsView.vue` — Existing view to extend with "Cloud Storage" tab.
- `frontend/src/components/FolderTreeItem.vue` (Phase 4) — Lazy-loading tree component to extend for cloud provider nodes.
</canonical_refs>
<code_context>
## Existing Code Insights
### Reusable Assets
- `backend/storage/base.py` (`StorageBackend` ABC) — New cloud backends subclass this directly. All 4 abstract methods beyond `generate_presigned_put_url` must be implemented.
- `backend/storage/minio_backend.py` — Template for `asyncio.to_thread()` pattern, error handling shape, and constructor signature.
- `backend/db/models.py` (`CloudConnection`) — Table already exists; no new migration needed for the connection model itself. A new Alembic migration may be needed to add `storage_backend` column to `documents` if not already present (verify).
- `frontend/src/components/FolderTreeItem.vue` — Existing lazy-load tree item; extend to support cloud provider root nodes with a different icon and live-fetch behavior.
- `frontend/src/views/SettingsView.vue` — Tab-based layout; add "Cloud Storage" as a new tab following the same pattern as existing tabs.
- `GET /api/documents/{id}/content` (Phase 4, Plan 04-05) — PDF proxy endpoint. Phase 5 makes this backend-agnostic by routing through `get_storage_backend()` per document.
### Established Patterns
- **Factory pattern:** `get_storage_backend()` in `backend/storage/__init__.py` mirrors `get_provider()` in `backend/ai/__init__.py`. Cloud backends extend the factory with a `storage_backend` parameter (from the document record or upload context).
- **HKDF encryption:** Established for cloud credentials in CLAUDE.md. Same pattern as cloud credentials is already used in the codebase — reuse the derivation utility.
- **Pydantic whitelist response models:** `CloudConnectionOut` pattern from Phase 4 — never expose `credentials_enc`. Apply to all new cloud endpoints.
- **`asyncio.to_thread()`:** All sync SDK calls (cloud provider SDKs may be sync) wrapped in `asyncio.to_thread()` — matches MinIOBackend pattern.
- **Audit log:** `write_audit_log()` helper from Phase 4 — call on cloud connect, disconnect, and re-auth events.
- **`get_regular_user` dep:** All cloud connection endpoints use `get_regular_user` (admin blocked from this surface — CLOUD credentials are personal, not platform-managed).
### Integration Points
- `GET/POST /api/cloud/connections` — new endpoint group for connecting, listing, and disconnecting cloud backends.
- `GET /api/cloud/oauth/initiate/{provider}` — redirects user to OAuth consent URL.
- `GET /api/cloud/oauth/callback/{provider}` — FastAPI OAuth callback; exchanges code, saves credentials, redirects to Vue.
- `GET /api/cloud/folders/{provider}/{folder_id}` — lists children of a cloud folder (lazy-load tree).
- Upload endpoint (`POST /api/documents/upload`) — must detect active folder's backend and route accordingly.
- `GET /api/documents/{id}/content` — already proxies bytes; must resolve backend from `document.storage_backend`.
- Sidebar `FolderTreeItem.vue` — add cloud provider root nodes below local folder tree.
</code_context>
<specifics>
## Specific Ideas
- **Sidebar layout:** Local folders shown first under a "My Documents" section header; cloud providers below under a "Cloud Storage" section (or just listed as peer top-level nodes with a cloud icon). The visual separation makes it clear which node is local vs. remote.
- **Multiple providers:** All connected providers appear simultaneously in the sidebar — one node per connection. Disconnecting a provider removes its node from the tree.
- **Nextcloud/WebDAV UX copy:** The connection modal explains: "App password — can be revoked without changing your main password (recommended). Your account password — simpler to set up, but revocation requires changing your entire account password."
- **OAuth callback redirect:** On success, Vue reads `?cloud_connected=google_drive` query param in SettingsView's `onMounted` and shows a transient success toast. On error, reads `?cloud_error=…` and shows an error banner.
- **`REQUIRES_REAUTH` prompt:** When a connection has status `REQUIRES_REAUTH`, the SettingsView Cloud Storage tab shows a yellow badge and a "Reconnect" button that re-initiates the OAuth flow.
</specifics>
<deferred>
## Deferred Ideas
- **Document migration between backends** — user-initiated move of existing MinIO docs to a cloud provider. Out of scope for Phase 5; no migration is performed.
- **Cloud-native resumable upload URLs** (provider-specific presigned upload sessions) — skipped in favor of FastAPI intermediary (simpler). Can be added as a performance optimization in a future phase.
- **Shared cloud storage (team/organization)** — multiple users sharing one cloud backend. Out of scope; `cloud_connections` is per-user.
- **Cloud folder sync / offline cache** — syncing cloud folder trees to DB for offline browsing. Out of scope; live API + TTL cache is sufficient.
- **Email notifications on REQUIRES_REAUTH** — out of scope for Phase 5; status is visible in SettingsView.
</deferred>
---
*Phase: 5-Cloud Storage Backends*
*Context gathered: 2026-05-28*
@@ -0,0 +1,159 @@
# Phase 5: Cloud Storage Backends - Discussion Log
> **Audit trail only.** Do not use as input to planning, research, or execution agents.
> Decisions are captured in CONTEXT.md — this log preserves the alternatives considered.
**Date:** 2026-05-28
**Phase:** 5-cloud-storage-backends
**Areas discussed:** Backend scope, OAuth flow & token refresh, Storage selection UX, Cloud document retrieval
---
## Backend Scope
| Option | Description | Selected |
|--------|-------------|----------|
| All 4 in one phase | OneDrive, Google Drive, Nextcloud, WebDAV all in Phase 5 | ✓ |
| WebDAV + Nextcloud first | Ship simpler (credential-based) backends first; OAuth providers in Phase 6 | |
| Just one provider as MVP | One end-to-end provider to prove the pattern, others follow | |
**User's choice:** All 4 in one phase
**Notes:** User wants the full feature set shipped together.
---
## OAuth Flow & Token Refresh
### OAuth callback architecture
| Option | Description | Selected |
|--------|-------------|----------|
| FastAPI handles it, then redirects to Vue | Backend exchanges code for tokens, saves encrypted creds, redirects browser to Vue with success/error query param | ✓ |
| Vue intercepts the callback | Frontend catches redirect, POSTs code to FastAPI — auth code briefly in frontend | |
| You decide | Claude chooses | |
**User's choice:** FastAPI handles it, then redirects to Vue
**Notes:** Keeps tokens entirely server-side; consistent with existing auth architecture.
### Token refresh strategy
| Option | Description | Selected |
|--------|-------------|----------|
| On-demand refresh | Catch 401, refresh silently, retry — transparent to user | ✓ (via Other) |
| Proactive Celery beat refresh | Background task refreshes before expiry | |
| Fail and prompt re-auth | Mark REQUIRES_REAUTH on expiry, no silent refresh | |
**User's choice:** Automatic refresh (on-demand, transparent). Also explicitly requested disconnect per-connection + "Disconnect all" option.
**Notes:** Falls back to REQUIRES_REAUTH only on `invalid_grant` (refresh token itself revoked).
### Nextcloud/WebDAV credential method
| Option | Description | Selected |
|--------|-------------|----------|
| URL + username + app password | App passwords revocable individually — recommended | ✓ (via Other) |
| URL + username + real password | Simpler; revocation requires changing entire account password | |
| You decide | Claude picks | |
**User's choice:** Show both options in the UI with explanations and trade-offs; recommend app passwords. Backend stores whichever the user picks.
**Notes:** Both use HTTP Basic Auth at the protocol level. UI copy explains the difference.
---
## Storage Selection UX
### Sidebar cloud folder tree depth
| Option | Description | Selected |
|--------|-------------|----------|
| Lazy-load one level at a time | Expand a node → fetch its children from cloud API | ✓ |
| Show only root of each provider | Single node per provider, click opens full-screen cloud browser | |
| Pre-fetch 2 levels deep on connect | Eager fetch on connect; faster browsing, stale quickly | |
**User's choice:** Lazy-load one level at a time
**Notes:** Cloud providers appear as top-level sidebar nodes alongside local MinIO folders, matching a Windows Explorer / Nextcloud-style file manager layout.
### Upload destination
| Option | Description | Selected |
|--------|-------------|----------|
| Follows the active folder | Upload goes to the backend of the folder the user is viewing | ✓ |
| Default backend in settings | Global setting overridden per-upload | |
| Per-upload choice at upload time | Dropdown on every upload dialog | |
**User's choice:** Follows the active folder (context-driven)
**Notes:** No explicit setting needed — the active folder's backend determines the destination.
### Existing document migration
| Option | Description | Selected |
|--------|-------------|----------|
| Stay in MinIO — no migration | Existing docs unaffected; local and cloud coexist | ✓ |
| Optional migration | Post-connect prompt to migrate existing docs | |
| You decide | | |
**User's choice:** Stay in MinIO — no migration
**Notes:** CLOUD-03 satisfied by coexistence without migration.
### Cloud provider management location
| Option | Description | Selected |
|--------|-------------|----------|
| Existing SettingsView, new "Cloud Storage" tab | Add tab to SettingsView alongside existing tabs | ✓ |
| Dedicated /cloud-storage route | New full-page view | |
| Sidebar action on cloud provider node | Gear icon → management popover | |
**User's choice:** New "Cloud Storage" tab in SettingsView
---
## Cloud Document Retrieval
### Upload path for cloud backends
| Option | Description | Selected |
|--------|-------------|----------|
| FastAPI intermediary | File bytes go through FastAPI → cloud provider API | ✓ |
| Cloud-native resumable upload URLs | Provider-specific upload session URL generated and sent to browser | |
| You decide | | |
**User's choice:** FastAPI intermediary for cloud uploads
**Notes:** Presigned-PUT-URL flow stays MinIO-only. Cloud backends' `generate_presigned_put_url` raises `NotImplementedError`.
### Download/preview path
| Option | Description | Selected |
|--------|-------------|----------|
| Same /api/documents/{id}/content proxy | Backend resolves StorageBackend from document.storage_backend | ✓ |
| Separate /api/documents/{id}/cloud-content | Parallel endpoint for cloud docs | |
| Temporary cloud provider URL (redirect) | Return provider's signed download URL to browser — exposes cloud URLs | |
**User's choice:** Same proxy endpoint
**Notes:** Frontend remains storage-backend-agnostic.
### Cloud folder tree freshness
| Option | Description | Selected |
|--------|-------------|----------|
| Live calls + 60s in-memory TTL cache | Per-folder cache keyed by user+provider+path; 60s TTL | ✓ |
| Live calls only, no cache | Always fresh; no protection against rapid UI interactions | |
| You decide | | |
**User's choice:** Live calls + 60s in-memory TTL cache
**Notes:** User raised valid concern about cloud API rate limits and potential throttling. Claude explained: human-paced browsing is well within all provider limits (Google Drive: 12k req/100s per user); TTL cache protects against collapse/re-expand patterns. No DB sync needed.
---
## Claude's Discretion
- Python OAuth library choice (Google: `google-auth-oauthlib`; Microsoft: `msal`)
- WebDAV Python library choice (`webdavclient3` vs. `aiohttp` with manual PROPFIND)
- TTL cache implementation (`cachetools.TTLCache` vs. dict + timestamp)
- OAuth state store implementation (Redis / short-lived DB row / signed JWT)
## Deferred Ideas
- Document migration between backends (local → cloud)
- Cloud-native resumable upload URLs (performance optimization)
- Shared/team cloud storage
- Cloud folder tree DB sync / offline cache
- Email notifications on REQUIRES_REAUTH