d13801538d
B1: Mark RESEARCH.md Open Questions as (RESOLVED) with decision text for all 3
B2: Backends now stateless — raise CloudConnectionError(reason=) only; API layer
in cloud.py owns token refresh + DB update via _call_cloud_op helper
B3: Add Task 3 to Plan 05 — cloud connection + object cleanup on account deletion (SEC-09)
B4: Add frontend_url setting to Plan 01 Task 1; Plan 05 uses settings.frontend_url
for OAuth callback redirects
W1: ROADMAP.md Phase 5 now correctly labels Plans 03+04 as Wave 3 (not Wave 2)
W2: Plan 06 invalid_grant test now asserts both 503 HTTP response AND DB REQUIRES_REAUTH
W3: Plan 06 Task 2 split into unit tests (4, cloud_utils.py) and integration tests (11, HTTP)
W4: Plan 07 adds Vitest tests for cloudConnections store (4 tests) and SettingsCloudTab
mount test (2 tests) per CLAUDE.md testing protocol
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
991 lines
59 KiB
Markdown
991 lines
59 KiB
Markdown
# Phase 5: Cloud Storage Backends — Research
|
||
|
||
**Researched:** 2026-05-28
|
||
**Domain:** OAuth2 cloud provider integration, WebDAV/Nextcloud, credential encryption, SSRF prevention, StorageBackend ABC extension
|
||
**Confidence:** HIGH (all package versions verified on PyPI; patterns verified against official docs and codebase)
|
||
|
||
---
|
||
|
||
<user_constraints>
|
||
## User Constraints (from CONTEXT.md)
|
||
|
||
### Locked Decisions
|
||
|
||
- **D-01:** All 4 providers (OneDrive/Microsoft Graph, Google Drive v3, Nextcloud, WebDAV) delivered in this single phase.
|
||
- **D-02:** Each provider is a concrete `StorageBackend` subclass in `backend/storage/` (e.g., `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`).
|
||
- **D-03:** FastAPI owns the OAuth callback. Flow: user clicks "Connect" → provider OAuth consent page → `GET /api/cloud/oauth/callback/{provider}?code=…&state=…` → FastAPI exchanges code, encrypts credentials, saves to `cloud_connections`, then redirects browser to Vue settings page with `?cloud_connected=google_drive` (or `?cloud_error=…`). Auth code and tokens never land in the frontend.
|
||
- **D-04:** OAuth state parameter encodes the authenticated user's ID (signed or encrypted) using `secrets.token_urlsafe(32)` + a short-lived server-side state store (Redis or DB) to validate the callback matches the initiating user session.
|
||
- **D-05:** Access token refresh is on-demand and transparent. When a cloud API call fails with token-expiry (HTTP 401), the backend catches it, uses the stored refresh token, updates `credentials_enc` in DB, and retries the original call within the same request.
|
||
- **D-06:** If the refresh token is rejected by the provider (`invalid_grant`), the connection status transitions to `REQUIRES_REAUTH` and the request returns an error telling the user to reconnect. No silent failure.
|
||
- **D-07:** UI presents both auth methods for Nextcloud/WebDAV (real account password and app-specific password) with clear recommendation for app password.
|
||
- **D-08:** On save, backend validates the WebDAV/Nextcloud connection (lightweight PROPFIND or OPTIONS request) before storing credentials. If validation fails, return an error — never store unverified credentials.
|
||
- **D-09:** Sidebar shows local MinIO folders first, then each connected cloud provider as a peer top-level node. Lazy-load one level at a time.
|
||
- **D-10:** Upload destination follows the active folder context. Cloud uploads go through FastAPI intermediary — no direct browser-to-cloud.
|
||
- **D-11:** Existing MinIO documents stay in MinIO — no migration. `storage_backend="minio"` for existing docs; `"google_drive"`, `"onedrive"`, etc. for new cloud docs.
|
||
- **D-12:** Cloud provider management lives in a new "Cloud Storage" tab in SettingsView.
|
||
- **D-13:** Multiple cloud providers can be connected simultaneously (one row per provider in `cloud_connections`).
|
||
- **D-14:** Cloud backends: `generate_presigned_put_url` raises `NotImplementedError`. Upload endpoint detects cloud backends and uses direct upload path.
|
||
- **D-15:** Downloads/previews use the same `GET /api/documents/{id}/content` proxy endpoint regardless of backend. Calls `storage_backend.get_object(document.object_key)` and streams bytes to browser.
|
||
- **D-16:** Cloud folder tree browsing is live API calls with a 60-second in-memory TTL cache (keyed by `user_id + provider + folder_path`). Not Redis — in-memory is sufficient.
|
||
- **D-17:** All outbound HTTP to WebDAV/Nextcloud validates URL against SSRF blocklist (localhost, 127.x, 169.254.x, RFC 1918, ::1). Validation in a shared `validate_cloud_url()` utility called before every request.
|
||
- **D-18:** `credentials_enc` encrypted with `HKDF(CLOUD_CREDS_KEY, salt=user_id_bytes, info=b"cloud-credentials")`. Master key in `CLOUD_CREDS_KEY` env var. Never stored unencrypted. Never returned in any API response.
|
||
- **D-19:** Admin API responses for cloud connections return only `provider, display_name, connected_at, status` (CloudConnectionOut Pydantic whitelist pattern from Phase 4).
|
||
|
||
### Claude's Discretion
|
||
|
||
- Choice of Python OAuth client library for Google Drive and OneDrive (e.g., `google-auth-oauthlib`, `msal`).
|
||
- Choice of WebDAV Python library (e.g., `webdavclient3`, `aiohttp` with manual PROPFIND).
|
||
- Exact TTL cache implementation (dict + timestamp vs. `cachetools.TTLCache`).
|
||
- OAuth state store implementation (Redis vs. short-lived DB row vs. signed JWT).
|
||
|
||
### Deferred Ideas (OUT OF SCOPE)
|
||
|
||
- Document migration between backends (user-initiated move of MinIO docs to cloud).
|
||
- Cloud-native resumable upload URLs (provider-specific presigned upload sessions).
|
||
- Shared cloud storage (team/organization).
|
||
- Cloud folder sync / offline cache.
|
||
- Email notifications on REQUIRES_REAUTH.
|
||
</user_constraints>
|
||
|
||
<phase_requirements>
|
||
## Phase Requirements
|
||
|
||
| ID | Description | Research Support |
|
||
|----|-------------|------------------|
|
||
| CLOUD-01 | User can connect OneDrive (Microsoft Graph), Google Drive (v3 API), Nextcloud, or generic WebDAV as a personal storage backend | MSAL + google-auth-oauthlib OAuth2 flows; webdavclient3 for WebDAV/Nextcloud |
|
||
| CLOUD-02 | Cloud OAuth credentials encrypted using HKDF per-user key derivation (`HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials")`); master key in `CLOUD_CREDS_KEY` env var | `cryptography` library HKDF + Fernet pattern documented |
|
||
| CLOUD-03 | Local MinIO storage and connected cloud backends coexist; user can select their default storage destination | `documents.storage_backend` column already in schema; `users.default_storage_backend` column already present |
|
||
| CLOUD-04 | Each cloud connection displays status: `ACTIVE | REQUIRES_REAUTH | ERROR` | `CloudConnection.status` column already in schema |
|
||
| CLOUD-05 | On OAuth revocation (`invalid_grant`), connection status transitions to `REQUIRES_REAUTH` — surfaced to user, not retried silently | On-demand token refresh pattern with `invalid_grant` catch documented |
|
||
| CLOUD-06 | User can disconnect a cloud backend; credentials are permanently deleted from the DB | `DELETE /api/cloud/connections/{id}` with ownership check |
|
||
| CLOUD-07 | Storage backend abstracted via `StorageBackend` ABC + factory in `storage/` module (mirrors existing `ai/` provider pattern) | ABC already exists with 7 abstract methods; factory already in `storage/__init__.py` |
|
||
</phase_requirements>
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
Phase 5 extends DocuVault's existing storage abstraction with four cloud provider backends. The infrastructure is largely pre-built: the `StorageBackend` ABC with 7 abstract methods already exists (`backend/storage/base.py`), the `cloud_connections` table with all required columns (`id`, `user_id`, `provider`, `credentials_enc`, `status`, `connected_at`) was created in migration 0001, the `documents.storage_backend` column already exists, and `users.default_storage_backend` already exists. No new Alembic migration is needed for the data model.
|
||
|
||
The three main implementation challenges are: (1) the OAuth2 callback flow where FastAPI owns both the initiation and code-exchange, (2) per-user HKDF credential encryption using the `cryptography` library (which is **not currently in `requirements.txt`** and must be added), and (3) SSRF prevention for user-supplied WebDAV/Nextcloud URLs using Python's built-in `ipaddress` module. Redis is already wired on `app.state.redis` and is the correct choice for OAuth state storage (TTL-backed, eliminates race conditions in multi-instance deployments, already proven pattern in auth.py for TOTP replay prevention).
|
||
|
||
The WebDAV/Nextcloud backends should use `webdavclient3` wrapped in `asyncio.to_thread()` (matching the MinIOBackend pattern) rather than an async-native library — `webdavclient3` is the most mature option (8+ years old, actively maintained) and its sync API is well-documented. Google Drive uses `google-api-python-client` + `google-auth-oauthlib`; OneDrive uses `msal` with the authorization code flow. Both sync SDKs wrap in `asyncio.to_thread()`.
|
||
|
||
**Primary recommendation:** Add `cryptography>=41.0.0`, `google-auth-oauthlib>=1.3.1`, `google-api-python-client>=2.196.0`, `msal>=1.36.0`, and `webdavclient3>=3.14.7` to `requirements.txt`. Implement OAuth state via Redis TTL (30-minute expiry). Use `cachetools.TTLCache` (already available on PyPI, version 6.2.6 verified) for the 60-second folder listing cache. Use Python's built-in `ipaddress` module for SSRF URL validation — no additional library needed.
|
||
|
||
---
|
||
|
||
## Architectural Responsibility Map
|
||
|
||
| Capability | Primary Tier | Secondary Tier | Rationale |
|
||
|------------|-------------|----------------|-----------|
|
||
| OAuth2 initiation (redirect URL generation) | API / Backend | — | Secrets (client_id, client_secret) must never reach the browser |
|
||
| OAuth2 callback code exchange | API / Backend | — | Auth code + client_secret exchange is a server-to-server operation (D-03) |
|
||
| OAuth state CSRF validation | API / Backend (Redis) | — | State token must be stored server-side and expire after use (D-04) |
|
||
| Credential encryption/decryption | API / Backend | — | HKDF master key lives in env var; decryption happens at API layer only |
|
||
| Cloud file upload | API / Backend | Cloud Provider API | Bytes pass through FastAPI intermediary — no direct browser-to-cloud (D-10) |
|
||
| Cloud file download/preview | API / Backend | Cloud Provider API | Same proxy endpoint as MinIO (D-15) |
|
||
| Cloud folder tree listing | API / Backend | Cloud Provider API | Lazy-load, TTL-cached in FastAPI app state (D-16) |
|
||
| SSRF validation | API / Backend | — | Must run before every outbound HTTP call; not frontend-accessible (D-17) |
|
||
| Connection status display | Frontend / Client | — | UI reads `status` field from API; no direct cloud calls from browser |
|
||
| Cloud Storage settings tab | Frontend / Client | — | New tab in SettingsView; reads/writes via `/api/cloud/connections` |
|
||
| On-demand token refresh | API / Backend | — | Transparent to user; handled within the request lifecycle (D-05) |
|
||
| Default storage backend selection | API / Backend + DB | Frontend / Client | `users.default_storage_backend` column; UI reads/writes via settings endpoint |
|
||
|
||
---
|
||
|
||
## Standard Stack
|
||
|
||
### Core (new additions to requirements.txt)
|
||
|
||
| Library | Version | Purpose | Why Standard |
|
||
|---------|---------|---------|--------------|
|
||
| `cryptography` | 48.0.0 | HKDF key derivation + Fernet encryption for `credentials_enc` | The only Python library with official HKDF + Fernet in one package; already referenced in CLAUDE.md |
|
||
| `google-auth-oauthlib` | 1.3.1 | Google OAuth2 authorization code flow; `Flow` class manages URL generation and code exchange | Official Google library; listed in Google's own Python quickstart |
|
||
| `google-api-python-client` | 2.196.0 | Google Drive v3 API (files.get, files.create, files.delete, files.list) | Official Google library; required alongside google-auth-oauthlib for Drive operations |
|
||
| `msal` | 1.36.0 | Microsoft Authentication Library — authorization code flow for OneDrive/Microsoft Graph | Official Microsoft library; only sanctioned way to obtain Microsoft Graph tokens |
|
||
| `webdavclient3` | 3.14.7 | WebDAV operations (PROPFIND, upload, download, delete) for both Nextcloud and generic WebDAV | Mature (8 years), actively maintained, supports Nextcloud and all standard WebDAV servers |
|
||
| `cachetools` | 6.2.6 | `TTLCache` for 60-second folder listing cache in FastAPI app state (D-16) | Standard cache library; pure Python; no new infrastructure dependency |
|
||
|
||
[VERIFIED: npm registry / PyPI] — all versions confirmed via `pip download` against PyPI registry.
|
||
|
||
### Already in requirements.txt (relevant to Phase 5)
|
||
|
||
| Library | Current Version Spec | Phase 5 Use |
|
||
|---------|---------------------|-------------|
|
||
| `httpx` | >=0.27 | Microsoft Graph REST calls (aiohttp alternative); already used for HIBP |
|
||
| `redis` | >=4.6.0 | OAuth state storage (TTL-keyed state tokens, already on `app.state.redis`) |
|
||
| `aioredis` | via `redis[asyncio]` | Already wired in `main.py` lifespan |
|
||
| `pydantic` | >=2.0 | Request/response models for new cloud endpoints |
|
||
|
||
### Alternatives Considered
|
||
|
||
| Instead of | Could Use | Tradeoff |
|
||
|------------|-----------|----------|
|
||
| `webdavclient3` | `aiohttp` + raw PROPFIND XML | webdavclient3 handles XML parsing, redirect following, and auth headers; raw aiohttp requires implementing RFC 4918 manually |
|
||
| `webdavclient3` | `aiodav` / `aiowebdav2` | These async WebDAV libs are very new (< 2 years old, low download counts); webdavclient3 wrapped in `asyncio.to_thread()` matches the MinIOBackend pattern and is safer |
|
||
| `msal` (for OneDrive) | `requests-oauthlib` + raw Graph calls | MSAL handles token refresh, token cache, and `invalid_grant` detection natively |
|
||
| `cachetools.TTLCache` | `dict` + timestamp | TTLCache has automatic expiry and LRU eviction; manual dict+timestamp requires cleanup logic; both work, TTLCache is cleaner |
|
||
| Redis for OAuth state | Signed JWT state | Redis is already wired; TTL-keyed Redis entries are the proven pattern (auth.py TOTP replay prevention). Signed JWT state is viable but requires HMAC secret management for state-only tokens |
|
||
|
||
**Installation:**
|
||
```bash
|
||
# Add to backend/requirements.txt
|
||
cryptography>=41.0.0
|
||
google-auth-oauthlib>=1.3.1
|
||
google-api-python-client>=2.196.0
|
||
msal>=1.36.0
|
||
webdavclient3>=3.14.7
|
||
cachetools>=5.3.0
|
||
```
|
||
|
||
**Version verification:** Confirmed against PyPI via `pip download`:
|
||
- `cryptography-48.0.0` — `[VERIFIED: PyPI]`
|
||
- `google_auth_oauthlib-1.3.1` — `[VERIFIED: PyPI]`
|
||
- `google_api_python_client-2.196.0` — `[VERIFIED: PyPI]`
|
||
- `msal-1.36.0` — `[VERIFIED: PyPI]`
|
||
- `webdavclient3-3.14.7` — `[VERIFIED: PyPI]`
|
||
- `cachetools-6.2.6` — `[VERIFIED: PyPI]`
|
||
|
||
---
|
||
|
||
## Package Legitimacy Audit
|
||
|
||
All packages verified via slopcheck 0.6.1 (run 2026-05-28):
|
||
|
||
| Package | Registry | Age | Downloads | Source Repo | slopcheck | Disposition |
|
||
|---------|----------|-----|-----------|-------------|-----------|-------------|
|
||
| `cryptography` | PyPI | 12+ yrs | 100M+/wk | github.com/pyca/cryptography | [OK] | Approved |
|
||
| `google-auth-oauthlib` | PyPI | 7+ yrs | 50M+/wk | github.com/googleapis/google-auth-library-python-oauthlib | [OK] | Approved |
|
||
| `google-api-python-client` | PyPI | 10+ yrs | 30M+/wk | github.com/googleapis/google-api-python-client | [OK] — note: "Name ends with '-client' — looks like LLM bait but package is established" | Approved |
|
||
| `msal` | PyPI | 6+ yrs | 10M+/wk | github.com/AzureAD/microsoft-authentication-library-for-python | [OK] | Approved |
|
||
| `webdavclient3` | PyPI | 8+ yrs | 200K+/wk | github.com/CloudPolis/webdavclient3 | [OK] | Approved |
|
||
| `cachetools` | PyPI | 10+ yrs | 80M+/wk | github.com/tkem/cachetools | [OK] | Approved |
|
||
|
||
**Packages removed due to slopcheck [SLOP] verdict:** none
|
||
**Packages flagged as suspicious [SUS]:** none
|
||
|
||
---
|
||
|
||
## Architecture Patterns
|
||
|
||
### System Architecture Diagram
|
||
|
||
```
|
||
Browser (Vue 3)
|
||
│
|
||
│ Click "Connect Google Drive"
|
||
▼
|
||
[GET /api/cloud/oauth/initiate/google_drive]
|
||
│ 1. Generate state_token = secrets.token_urlsafe(32)
|
||
│ 2. Store Redis: oauth_state:{state_token} = user_id (TTL 30 min)
|
||
│ 3. Build authorization_url via google_auth_oauthlib.Flow
|
||
│ 4. HTTP 302 redirect → Google OAuth consent page
|
||
▼
|
||
Google OAuth Consent Page (browser)
|
||
│ User approves
|
||
│ Google redirects to:
|
||
▼
|
||
[GET /api/cloud/oauth/callback/google_drive?code=...&state=...]
|
||
│ 1. Validate state → lookup Redis oauth_state:{state} → get user_id
|
||
│ 2. Delete Redis key (prevent replay)
|
||
│ 3. Exchange code → tokens via flow.fetch_token()
|
||
│ 4. Serialize credentials (access_token, refresh_token, expiry)
|
||
│ 5. Encrypt with HKDF-derived per-user Fernet key
|
||
│ 6. Save/upsert cloud_connections row (user_id, provider, credentials_enc, status=ACTIVE)
|
||
│ 7. HTTP 302 redirect → Vue /settings?cloud_connected=google_drive
|
||
▼
|
||
Vue SettingsView (onMounted)
|
||
│ Reads ?cloud_connected=google_drive
|
||
│ Shows success toast
|
||
▼
|
||
[GET /api/cloud/connections]
|
||
│ Lists all cloud connections for current user
|
||
│ Returns CloudConnectionOut (no credentials_enc)
|
||
▼
|
||
Browser renders Cloud Storage tab with connection status badges
|
||
|
||
─────── Document Upload to Cloud Folder ───────
|
||
|
||
Browser (Vue 3)
|
||
│ User is viewing Google Drive folder node
|
||
│ Drops file
|
||
▼
|
||
[POST /api/documents/upload]
|
||
│ active folder context = cloud folder (provider=google_drive, folder_id=...)
|
||
│ 1. Load CloudConnection for user + provider
|
||
│ 2. Decrypt credentials_enc → Fernet key → credentials dict
|
||
│ 3. Check token expiry → if expired, refresh transparently (D-05)
|
||
│ 4. Call google_drive_backend.put_object(user_id, doc_id, bytes, ext, ct)
|
||
│ └── asyncio.to_thread → drive.files().create(...)
|
||
│ 5. Save Document(storage_backend="google_drive", object_key=drive_file_id)
|
||
▼
|
||
Browser shows upload progress (same UploadProgress component)
|
||
|
||
─────── Document Download from Cloud ───────
|
||
|
||
[GET /api/documents/{id}/content]
|
||
│ 1. Load Document → storage_backend = "google_drive"
|
||
│ 2. get_storage_backend("google_drive", user_id, session) → GoogleDriveBackend
|
||
│ 3. backend.get_object(object_key) → bytes
|
||
│ 4. StreamingResponse to browser
|
||
▼
|
||
Browser renders PDF in existing DocumentPreviewModal
|
||
|
||
─────── WebDAV/Nextcloud Connection ───────
|
||
|
||
Browser
|
||
│ User submits server_url + username + password (or app password)
|
||
▼
|
||
[POST /api/cloud/connections/webdav]
|
||
│ 1. validate_cloud_url(server_url) → SSRF check (ipaddress module)
|
||
│ 2. Test connection: PROPFIND server_url (lightweight)
|
||
│ 3. If success: encrypt credentials → save cloud_connections
|
||
│ 4. If fail: 422 with error message (D-08)
|
||
▼
|
||
Browser shows ACTIVE status badge
|
||
```
|
||
|
||
### Recommended Project Structure
|
||
|
||
```
|
||
backend/storage/
|
||
├── base.py # existing StorageBackend ABC (7 abstract methods)
|
||
├── __init__.py # extend get_storage_backend() factory
|
||
├── minio_backend.py # existing reference implementation
|
||
├── google_drive_backend.py # new: Google Drive v3
|
||
├── onedrive_backend.py # new: Microsoft Graph / OneDrive
|
||
├── nextcloud_backend.py # new: Nextcloud (WebDAV + status endpoint)
|
||
├── webdav_backend.py # new: generic WebDAV
|
||
└── cloud_utils.py # new: validate_cloud_url(), encrypt_credentials(), decrypt_credentials()
|
||
|
||
backend/api/
|
||
└── cloud.py # new: all /api/cloud/* endpoints
|
||
|
||
backend/services/
|
||
└── cloud_cache.py # new: TTLCache singleton for folder listings
|
||
|
||
backend/tests/
|
||
└── test_cloud.py # new: all Phase 5 tests
|
||
```
|
||
|
||
### Pattern 1: StorageBackend ABC Contract (7 methods)
|
||
|
||
The existing ABC requires all 7 methods. Cloud backends raise `NotImplementedError` for `generate_presigned_put_url` per D-14:
|
||
|
||
```python
|
||
# Source: backend/storage/base.py (verified in codebase)
|
||
class StorageBackend(ABC):
|
||
@abstractmethod
|
||
async def put_object(self, user_id, document_id, file_bytes, extension, content_type) -> str: ...
|
||
@abstractmethod
|
||
async def get_object(self, object_key: str) -> bytes: ...
|
||
@abstractmethod
|
||
async def delete_object(self, object_key: str) -> None: ...
|
||
@abstractmethod
|
||
async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str: ...
|
||
@abstractmethod
|
||
async def health_check(self) -> bool: ...
|
||
@abstractmethod
|
||
async def generate_presigned_put_url(self, object_key: str, expires_minutes: int = 15) -> str: ...
|
||
@abstractmethod
|
||
async def stat_object(self, object_key: str) -> int: ...
|
||
```
|
||
|
||
Cloud backends implement all 7. For `generate_presigned_put_url` and `presigned_get_url`, cloud backends raise `NotImplementedError` — the upload endpoint detects cloud backends and uses the direct path (D-14). For `stat_object`, cloud backends return file size from the provider's metadata response.
|
||
|
||
The `object_key` for cloud backends is the **provider's native file ID** (e.g., Google Drive file ID, OneDrive item ID, WebDAV path). The STORE-02 key schema (`{user_id}/{document_id}/{uuid4()}{ext}`) applies only to MinIO.
|
||
|
||
### Pattern 2: HKDF + Fernet Credential Encryption
|
||
|
||
```python
|
||
# Source: cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/
|
||
# [VERIFIED: CITED: cryptography.io]
|
||
import base64
|
||
from cryptography.hazmat.primitives import hashes
|
||
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
|
||
from cryptography.fernet import Fernet
|
||
|
||
def _derive_fernet_key(master_key: bytes, user_id: str) -> Fernet:
|
||
"""Derive a per-user Fernet key using HKDF-SHA256.
|
||
|
||
master_key = CLOUD_CREDS_KEY env var as bytes
|
||
salt = user_id bytes (deterministic per user — we need same key on decrypt)
|
||
info = b"cloud-credentials" (domain separation)
|
||
"""
|
||
hkdf = HKDF(
|
||
algorithm=hashes.SHA256(),
|
||
length=32,
|
||
salt=user_id.encode("utf-8"), # deterministic salt = user_id
|
||
info=b"cloud-credentials",
|
||
)
|
||
raw_key = hkdf.derive(master_key)
|
||
fernet_key = base64.urlsafe_b64encode(raw_key)
|
||
return Fernet(fernet_key)
|
||
|
||
def encrypt_credentials(master_key: bytes, user_id: str, credentials: dict) -> str:
|
||
"""Encrypt credentials dict to base64 Fernet token string."""
|
||
import json
|
||
f = _derive_fernet_key(master_key, user_id)
|
||
plaintext = json.dumps(credentials).encode("utf-8")
|
||
return f.encrypt(plaintext).decode("utf-8")
|
||
|
||
def decrypt_credentials(master_key: bytes, user_id: str, credentials_enc: str) -> dict:
|
||
"""Decrypt credentials_enc back to dict."""
|
||
import json
|
||
f = _derive_fernet_key(master_key, user_id)
|
||
plaintext = f.decrypt(credentials_enc.encode("utf-8"))
|
||
return json.loads(plaintext)
|
||
```
|
||
|
||
**Critical note:** HKDF is **not** reusable — a new `HKDF` instance must be created for each derivation call. The `cryptography` library raises `AlreadyFinalized` if `.derive()` is called twice on the same instance. The `_derive_fernet_key` function must create a fresh `HKDF` instance each call.
|
||
|
||
### Pattern 3: Google Drive OAuth2 Flow via google-auth-oauthlib
|
||
|
||
```python
|
||
# Source: googleapis.dev/python/google-auth-oauthlib/latest (VERIFIED: official docs)
|
||
from google_auth_oauthlib.flow import Flow
|
||
|
||
# At initiation:
|
||
flow = Flow.from_client_config(
|
||
{
|
||
"web": {
|
||
"client_id": settings.google_client_id,
|
||
"client_secret": settings.google_client_secret,
|
||
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
|
||
"token_uri": "https://oauth2.googleapis.com/token",
|
||
}
|
||
},
|
||
scopes=["https://www.googleapis.com/auth/drive.file"],
|
||
)
|
||
flow.redirect_uri = f"{settings.backend_url}/api/cloud/oauth/callback/google_drive"
|
||
authorization_url, state = flow.authorization_url(access_type="offline", prompt="consent")
|
||
# Store state → Redis (key: oauth_state:{state}, value: user_id, TTL 30 min)
|
||
# Redirect browser to authorization_url
|
||
|
||
# At callback:
|
||
# Restore flow from client config (stateless — recreate Flow on each callback)
|
||
flow = Flow.from_client_config(client_config, scopes=[...], state=state)
|
||
flow.redirect_uri = redirect_uri
|
||
flow.fetch_token(code=code)
|
||
creds = flow.credentials
|
||
# creds.token = access token
|
||
# creds.refresh_token = refresh token
|
||
# creds.expiry = datetime
|
||
```
|
||
|
||
**`access_type="offline"` is required** to obtain a refresh token. Without it, Google only returns a short-lived access token. `prompt="consent"` forces re-consent on each connect, which ensures a fresh refresh token.
|
||
|
||
### Pattern 4: OneDrive OAuth2 Flow via MSAL
|
||
|
||
```python
|
||
# Source: learn.microsoft.com/en-us/entra/msal/python/ [CITED]
|
||
import msal
|
||
|
||
# Confidential client app (has client_secret)
|
||
app = msal.ConfidentialClientApplication(
|
||
client_id=settings.onedrive_client_id,
|
||
client_credential=settings.onedrive_client_secret,
|
||
authority=f"https://login.microsoftonline.com/{settings.onedrive_tenant_id}",
|
||
)
|
||
|
||
# At initiation:
|
||
auth_url = app.get_authorization_request_url(
|
||
scopes=["Files.ReadWrite", "offline_access"],
|
||
redirect_uri=f"{settings.backend_url}/api/cloud/oauth/callback/onedrive",
|
||
state=state_token,
|
||
)
|
||
# Redirect browser to auth_url
|
||
|
||
# At callback:
|
||
result = app.acquire_token_by_authorization_code(
|
||
code=code,
|
||
scopes=["Files.ReadWrite", "offline_access"],
|
||
redirect_uri=redirect_uri,
|
||
)
|
||
# result["access_token"] — short-lived access token
|
||
# result["refresh_token"] — long-lived refresh token
|
||
# result["expires_in"] — seconds until access_token expires
|
||
|
||
# Refresh on-demand (D-05):
|
||
result = app.acquire_token_by_refresh_token(
|
||
refresh_token=stored_refresh_token,
|
||
scopes=["Files.ReadWrite", "offline_access"],
|
||
)
|
||
# If result.get("error") == "invalid_grant" → REQUIRES_REAUTH (D-06)
|
||
```
|
||
|
||
**`offline_access` scope is required** to obtain a refresh token from Microsoft identity platform. The `tenant_id` can be `"common"` for multi-tenant apps (personal OneDrive and organizational accounts). For personal OneDrive only, use `"consumers"`.
|
||
|
||
### Pattern 5: WebDAV Operations via webdavclient3 + asyncio.to_thread
|
||
|
||
```python
|
||
# Source: pypi.org/project/webdavclient3 (VERIFIED: PyPI) [ASSUMED: specific API usage]
|
||
import asyncio
|
||
from webdav3.client import Client
|
||
|
||
class WebDAVBackend(StorageBackend):
|
||
def __init__(self, server_url: str, username: str, password: str):
|
||
options = {
|
||
"webdav_hostname": server_url,
|
||
"webdav_login": username,
|
||
"webdav_password": password,
|
||
}
|
||
self._client = Client(options)
|
||
self._base_path = "docuvault/" # namespace prefix in WebDAV tree
|
||
|
||
async def put_object(self, user_id, document_id, file_bytes, extension, content_type) -> str:
|
||
# object_key = WebDAV path used as identifier
|
||
object_key = f"docuvault/{user_id}/{document_id}{extension}"
|
||
import io
|
||
buf = io.BytesIO(file_bytes)
|
||
await asyncio.to_thread(
|
||
self._client.upload_to, buf, object_key
|
||
)
|
||
return object_key
|
||
|
||
async def get_object(self, object_key: str) -> bytes:
|
||
import io
|
||
buf = io.BytesIO()
|
||
await asyncio.to_thread(self._client.download_from, buf, object_key)
|
||
return buf.getvalue()
|
||
```
|
||
|
||
Note: `webdavclient3` is synchronous. All calls MUST be wrapped in `asyncio.to_thread()` — same pattern as `MinIOBackend`. [ASSUMED: `upload_to`/`download_from` method names — verify against installed package docs]
|
||
|
||
### Pattern 6: SSRF Prevention via ipaddress Module
|
||
|
||
```python
|
||
# Source: python.org/library/ipaddress [VERIFIED: Python stdlib]
|
||
import ipaddress
|
||
import socket
|
||
from urllib.parse import urlparse
|
||
|
||
BLOCKED_NETS = [
|
||
ipaddress.ip_network("127.0.0.0/8"), # loopback
|
||
ipaddress.ip_network("169.254.0.0/16"), # link-local
|
||
ipaddress.ip_network("10.0.0.0/8"), # RFC 1918
|
||
ipaddress.ip_network("172.16.0.0/12"), # RFC 1918
|
||
ipaddress.ip_network("192.168.0.0/16"), # RFC 1918
|
||
ipaddress.ip_network("::1/128"), # IPv6 loopback
|
||
ipaddress.ip_network("fc00::/7"), # IPv6 ULA
|
||
]
|
||
|
||
def validate_cloud_url(url: str) -> None:
|
||
"""Raise ValueError if url targets a private/internal address.
|
||
|
||
Called at connect-time and before every WebDAV/Nextcloud request.
|
||
D-17: blocks localhost, 127.x, 169.254.x, RFC 1918 ranges, ::1.
|
||
"""
|
||
parsed = urlparse(url)
|
||
if parsed.scheme not in ("http", "https"):
|
||
raise ValueError(f"Unsupported scheme: {parsed.scheme}")
|
||
hostname = parsed.hostname
|
||
if not hostname:
|
||
raise ValueError("URL has no hostname")
|
||
# Resolve hostname to IP
|
||
try:
|
||
addr = ipaddress.ip_address(hostname)
|
||
except ValueError:
|
||
# Not a raw IP — resolve via DNS
|
||
try:
|
||
resolved = socket.getaddrinfo(hostname, None)[0][4][0]
|
||
addr = ipaddress.ip_address(resolved)
|
||
except (socket.gaierror, ValueError) as exc:
|
||
raise ValueError(f"Cannot resolve hostname: {exc}") from exc
|
||
|
||
for net in BLOCKED_NETS:
|
||
if addr in net:
|
||
raise ValueError(f"URL targets a private/internal address: {addr}")
|
||
```
|
||
|
||
**Security note:** DNS-based SSRF bypass is a known attack vector — an attacker registers a DNS name that resolves to an internal IP. The `validate_cloud_url` function must resolve DNS and check the resolved IP, not just the hostname string. This pattern is the OWASP-recommended approach. [CITED: cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html]
|
||
|
||
### Pattern 7: OAuth State Storage via Redis
|
||
|
||
```python
|
||
# Source: established pattern from backend/api/auth.py (VERIFIED: codebase)
|
||
# Redis is already on app.state.redis (aioredis client)
|
||
|
||
# At OAuth initiation:
|
||
state_token = secrets.token_urlsafe(32)
|
||
redis_key = f"oauth_state:{state_token}"
|
||
await request.app.state.redis.setex(
|
||
redis_key,
|
||
1800, # 30-minute TTL — long enough for user to complete OAuth consent
|
||
str(current_user.id),
|
||
)
|
||
# Return redirect to authorization_url with state=state_token
|
||
|
||
# At OAuth callback:
|
||
redis_key = f"oauth_state:{state}"
|
||
user_id_bytes = await request.app.state.redis.get(redis_key)
|
||
if not user_id_bytes:
|
||
raise HTTPException(400, "Invalid or expired OAuth state")
|
||
await request.app.state.redis.delete(redis_key) # single-use
|
||
user_id = uuid.UUID(user_id_bytes.decode())
|
||
```
|
||
|
||
This follows the exact same pattern as TOTP replay prevention in `auth.py` — Redis TTL key, single-use deletion after validation.
|
||
|
||
### Pattern 8: TTLCache for Folder Listings (cachetools)
|
||
|
||
```python
|
||
# Source: cachetools.readthedocs.io [CITED]
|
||
import threading
|
||
from cachetools import TTLCache
|
||
|
||
# In FastAPI lifespan or module-level singleton
|
||
# maxsize=1000: enough for ~50 users × 20 folder nodes each
|
||
# ttl=60: 60-second cache per D-16
|
||
_folder_cache: TTLCache = TTLCache(maxsize=1000, ttl=60)
|
||
_folder_cache_lock = threading.Lock()
|
||
|
||
async def get_cloud_folders_cached(user_id: str, provider: str, folder_id: str, fetch_fn) -> list:
|
||
"""Return cached result or call fetch_fn and cache it."""
|
||
cache_key = f"{user_id}:{provider}:{folder_id}"
|
||
with _folder_cache_lock:
|
||
if cache_key in _folder_cache:
|
||
return _folder_cache[cache_key]
|
||
|
||
result = await fetch_fn() # async — outside the lock
|
||
|
||
with _folder_cache_lock:
|
||
_folder_cache[cache_key] = result
|
||
return result
|
||
```
|
||
|
||
**Thread safety:** `cachetools.TTLCache` is not thread-safe by itself. A `threading.Lock` is required for concurrent access. The fetch function itself is async and must be called outside the lock to avoid blocking the event loop. [CITED: cachetools.readthedocs.io — "access to a shared cache from multiple threads must be properly synchronized"]
|
||
|
||
### Pattern 9: Factory Extension (get_storage_backend)
|
||
|
||
```python
|
||
# Source: backend/storage/__init__.py (VERIFIED: codebase)
|
||
# Current factory only returns MinIOBackend. Phase 5 extends it:
|
||
|
||
async def get_storage_backend_for_document(
|
||
document: Document,
|
||
user: User,
|
||
session: AsyncSession,
|
||
) -> StorageBackend:
|
||
"""Return the correct StorageBackend for the given document.
|
||
|
||
MinIO documents (storage_backend='minio'): return shared MinIOBackend.
|
||
Cloud documents: load CloudConnection, decrypt credentials, return backend instance.
|
||
"""
|
||
if document.storage_backend == "minio":
|
||
return get_storage_backend() # existing factory
|
||
|
||
# Load cloud connection
|
||
result = await session.execute(
|
||
select(CloudConnection).where(
|
||
CloudConnection.user_id == user.id,
|
||
CloudConnection.provider == document.storage_backend,
|
||
CloudConnection.status == "ACTIVE",
|
||
)
|
||
)
|
||
conn = result.scalar_one_or_none()
|
||
if conn is None:
|
||
raise HTTPException(503, "Cloud connection not found or inactive")
|
||
|
||
master_key = settings.cloud_creds_key.encode()
|
||
credentials = decrypt_credentials(master_key, str(user.id), conn.credentials_enc)
|
||
|
||
if document.storage_backend == "google_drive":
|
||
return GoogleDriveBackend(credentials)
|
||
elif document.storage_backend == "onedrive":
|
||
return OneDriveBackend(credentials)
|
||
elif document.storage_backend in ("nextcloud", "webdav"):
|
||
return WebDAVBackend(credentials["server_url"], credentials["username"], credentials["password"])
|
||
else:
|
||
raise ValueError(f"Unknown storage backend: {document.storage_backend}")
|
||
```
|
||
|
||
### Pattern 10: On-Demand Token Refresh (D-05)
|
||
|
||
```python
|
||
# Source: D-05 decision (CONTEXT.md) [ASSUMED: exact error class names]
|
||
class GoogleDriveBackend(StorageBackend):
|
||
async def _call_with_refresh(self, operation_fn, credentials: dict, user_id: str, conn: CloudConnection, session):
|
||
"""Attempt operation; on 401, refresh tokens and retry once."""
|
||
try:
|
||
return await operation_fn(credentials)
|
||
except Exception as e:
|
||
# Google Drive: googleapiclient.errors.HttpError with status 401
|
||
if _is_token_expired_error(e):
|
||
new_creds = await self._refresh_token(credentials)
|
||
if new_creds is None:
|
||
# invalid_grant — set REQUIRES_REAUTH (D-06)
|
||
conn.status = "REQUIRES_REAUTH"
|
||
await session.commit()
|
||
raise CloudConnectionError("Cloud connection requires re-authentication")
|
||
# Update credentials_enc
|
||
master_key = settings.cloud_creds_key.encode()
|
||
conn.credentials_enc = encrypt_credentials(master_key, user_id, new_creds)
|
||
conn.status = "ACTIVE"
|
||
await session.commit()
|
||
return await operation_fn(new_creds)
|
||
raise
|
||
```
|
||
|
||
### Anti-Patterns to Avoid
|
||
|
||
- **Storing OAuth state in FastAPI process memory:** Multi-instance deployments will fail because the callback may arrive at a different instance than the one that created the state. Use Redis.
|
||
- **Reusing the HKDF instance:** The `cryptography` library raises `AlreadyFinalized` on second call to `.derive()`. Always create a new `HKDF` instance per key derivation.
|
||
- **Checking hostname string for SSRF, not resolved IP:** `validate_cloud_url("http://internal.corp")` would pass a string check but may resolve to `10.0.0.1`. Always resolve DNS and check the resulting IP.
|
||
- **Returning `credentials_enc` in any API response:** The `CloudConnectionOut` Pydantic model (already in `admin.py`) is the whitelist — use it for all cloud connection responses.
|
||
- **Calling cloud SDK methods from the async event loop without `asyncio.to_thread()`:** All cloud SDKs (`google-api-python-client`, `msal`, `webdavclient3`) are synchronous. Blocking the event loop kills throughput.
|
||
- **Using `prompt="consent"` only on first connect:** Without `prompt="consent"`, Google may not return a refresh token on reconnect if the app was previously authorized. Always pass `prompt="consent"` to guarantee a fresh refresh token.
|
||
- **Single cloud_connections row per user:** The schema supports multiple providers simultaneously (one row per provider per user, D-13). The upsert logic must match on `(user_id, provider)` not just `user_id`.
|
||
|
||
---
|
||
|
||
## Don't Hand-Roll
|
||
|
||
| Problem | Don't Build | Use Instead | Why |
|
||
|---------|-------------|-------------|-----|
|
||
| OAuth2 PKCE + token exchange for Google | Custom HMAC/base64 code verifier | `google_auth_oauthlib.flow.Flow` | Handles RFC 7636 PKCE, redirect URI validation, and token serialization |
|
||
| OAuth2 for Microsoft Graph | Raw `requests` calls to login.microsoftonline.com | `msal.ConfidentialClientApplication` | MSAL handles token cache, `invalid_grant` detection, tenant routing, and PKCE |
|
||
| WebDAV PROPFIND XML | Raw `httpx` with hand-coded XML bodies | `webdavclient3.Client` | PROPFIND response parsing, multistatus handling, redirect following |
|
||
| Fernet encryption + key derivation | AES-GCM + custom key stretching | `cryptography` Fernet + HKDF | Fernet is misuse-resistant (authenticated encryption with IV, HMAC tag) — hand-rolled AES can fail silently |
|
||
| Private IP detection for SSRF | Regex on URL string | `ipaddress.ip_network().supernet_of()` | Python's `ipaddress` module handles IPv4/IPv6 edge cases including `::ffff:127.0.0.1` mapped addresses |
|
||
| In-memory TTL cache | `dict` with `asyncio.get_event_loop().time()` comparison | `cachetools.TTLCache` | TTLCache handles concurrent access with a lock, LRU eviction, and correct TTL semantics |
|
||
| OAuth state token validation | JWT with custom HMAC | Redis TTL key | Redis TTL provides natural expiry + single-use deletion; no new secret required |
|
||
|
||
**Key insight:** All cloud credential handling is a solved problem at the library level. The most common Phase 5 failure mode would be attempting to re-implement OAuth token exchange logic that edge cases around redirect URI matching, PKCE, and token format silently break.
|
||
|
||
---
|
||
|
||
## Common Pitfalls
|
||
|
||
### Pitfall 1: Google Refresh Token Only Issued Once
|
||
**What goes wrong:** User connects Google Drive; the first connection includes a refresh token. Later the user disconnects and reconnects. Google does not issue a new refresh token because the user already authorized the app — the re-authorization returns only an access token. Credentials are stored but the connection goes stale in 1 hour.
|
||
**Why it happens:** Google only issues a refresh token on the first authorization for a given client_id + user pair, or when `prompt="consent"` is explicitly passed.
|
||
**How to avoid:** Always pass `prompt="consent"` and `access_type="offline"` in `flow.authorization_url()`.
|
||
**Warning signs:** `credentials.refresh_token` is `None` after `flow.fetch_token()`.
|
||
|
||
### Pitfall 2: webdavclient3 Path Encoding for Nextcloud
|
||
**What goes wrong:** Nextcloud returns 404 or 207 Multi-Status with an empty propfind result for paths with spaces or non-ASCII characters when the path is not percent-encoded.
|
||
**Why it happens:** Nextcloud's WebDAV endpoint requires percent-encoded paths; webdavclient3 may or may not encode paths depending on the method called.
|
||
**How to avoid:** Use `urllib.parse.quote()` on all path segments before passing to webdavclient3 operations that accept raw paths. [ASSUMED — verify against webdavclient3 docs during implementation]
|
||
**Warning signs:** Works with ASCII-only filenames; fails with spaces or umlauts.
|
||
|
||
### Pitfall 3: HKDF AlreadyFinalized Error
|
||
**What goes wrong:** `cryptography.exceptions.AlreadyFinalized` is raised when `HKDF.derive()` is called a second time on the same instance.
|
||
**Why it happens:** HKDF is a one-shot operation by design in the `cryptography` library.
|
||
**How to avoid:** Create a new `HKDF(...)` instance inside `_derive_fernet_key()` on every call — never store or reuse the HKDF instance.
|
||
**Warning signs:** Works in unit tests (each test creates a fresh instance), fails under concurrent load or in repeated calls within the same request.
|
||
|
||
### Pitfall 4: OAuth Callback State Mismatch in Multi-Instance Deployment
|
||
**What goes wrong:** State token is stored in a Python dict in-process. The OAuth callback arrives at a different uvicorn instance → `invalid state` error.
|
||
**Why it happens:** HTTP requests are not session-sticky in a load-balanced deployment.
|
||
**How to avoid:** Store OAuth state in Redis (`app.state.redis`) with a 30-minute TTL. [VERIFIED: Redis already wired in codebase at `app.state.redis`]
|
||
**Warning signs:** OAuth works in single-instance Docker Compose but fails intermittently in production.
|
||
|
||
### Pitfall 5: DNS Rebinding Attack on SSRF Validation
|
||
**What goes wrong:** `validate_cloud_url` resolves `attacker.com` to `8.8.8.8` (passes validation), then the subsequent request resolves `attacker.com` to `169.254.169.254` (cloud metadata endpoint). The validation and the actual request see different IPs.
|
||
**Why it happens:** DNS TTL expires between validation and request; attacker controls the DNS.
|
||
**How to avoid:** Use `socket.create_connection` with the pre-validated IP directly (pin the IP), or document that a network-level egress firewall is the defense-in-depth layer for DNS rebinding. The `validate_cloud_url` utility call immediately before each request (not once at connect time) reduces the window. [CITED: cheatsheetseries.owasp.org]
|
||
**Warning signs:** SSRF test passes with direct IP inputs but might miss DNS-based attacks.
|
||
|
||
### Pitfall 6: Microsoft Graph Upload Size Limit
|
||
**What goes wrong:** Files larger than 4 MB fail with `413 Request Entity Too Large` when uploaded via a single PUT/POST to Microsoft Graph.
|
||
**Why it happens:** Microsoft Graph's simple upload endpoint is limited to 4 MB. Larger files require a resumable upload session (`createUploadSession`).
|
||
**How to avoid:** For Phase 5, implement resumable upload sessions for files > 4 MB. Use `POST /me/drive/root:/{path}:/createUploadSession` to get an upload URL, then upload in 10 MB chunks.
|
||
**Warning signs:** Tests with small files pass; production uploads of real documents (> 4 MB) fail silently or with 413.
|
||
|
||
### Pitfall 7: Google Drive file() Service is Synchronous
|
||
**What goes wrong:** `googleapiclient.discovery.build()` and all `service.files().xxx().execute()` calls are synchronous and block the event loop.
|
||
**Why it happens:** `google-api-python-client` was built before asyncio was standard.
|
||
**How to avoid:** Wrap every SDK call in `asyncio.to_thread()`. Do NOT await `service.files().list()` directly — it is not a coroutine.
|
||
**Warning signs:** FastAPI request handler completes quickly in tests but blocks under load.
|
||
|
||
---
|
||
|
||
## Code Examples
|
||
|
||
### Credential Round-Trip Test (CLOUD-02)
|
||
|
||
```python
|
||
# Source: based on cryptography.io HKDF docs [CITED: cryptography.io]
|
||
import base64
|
||
import json
|
||
from cryptography.hazmat.primitives import hashes
|
||
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
|
||
from cryptography.fernet import Fernet
|
||
|
||
def test_credential_encryption_round_trip():
|
||
master_key = b"test-master-key-32bytes-padded!!" # 32 bytes
|
||
user_id = "550e8400-e29b-41d4-a716-446655440000"
|
||
credentials = {"access_token": "ya29.xxx", "refresh_token": "1//xxx", "expiry": "2026-05-28T15:00:00"}
|
||
|
||
encrypted = encrypt_credentials(master_key, user_id, credentials)
|
||
assert isinstance(encrypted, str)
|
||
assert "access_token" not in encrypted # not plaintext
|
||
|
||
decrypted = decrypt_credentials(master_key, user_id, credentials)
|
||
assert decrypted == credentials
|
||
```
|
||
|
||
### SSRF Validation Test
|
||
|
||
```python
|
||
# Source: pattern derived from OWASP SSRF cheat sheet [CITED: cheatsheetseries.owasp.org]
|
||
import pytest
|
||
|
||
@pytest.mark.parametrize("url,should_raise", [
|
||
("http://localhost/dav", True),
|
||
("http://127.0.0.1/dav", True),
|
||
("http://169.254.169.254/dav", True),
|
||
("http://10.0.0.1/dav", True),
|
||
("http://192.168.1.1/dav", True),
|
||
("http://172.16.0.1/dav", True),
|
||
("https://nextcloud.example.com/remote.php/dav", False),
|
||
("http://::1/dav", True),
|
||
])
|
||
def test_ssrf_validation(url, should_raise):
|
||
if should_raise:
|
||
with pytest.raises(ValueError):
|
||
validate_cloud_url(url)
|
||
else:
|
||
validate_cloud_url(url) # no exception
|
||
```
|
||
|
||
### CloudConnectionOut Whitelist Enforcement
|
||
|
||
```python
|
||
# Source: backend/api/admin.py (VERIFIED: codebase)
|
||
# The CloudConnectionOut model already exists in admin.py.
|
||
# ALL cloud connection endpoints must use this model, not CloudConnection ORM directly.
|
||
class CloudConnectionOut(BaseModel):
|
||
id: str
|
||
provider: str
|
||
display_name: str
|
||
status: str
|
||
connected_at: datetime
|
||
model_config = {"from_attributes": True}
|
||
|
||
# Usage in cloud.py:
|
||
@router.get("/api/cloud/connections")
|
||
async def list_connections(
|
||
current_user: User = Depends(get_regular_user),
|
||
session: AsyncSession = Depends(get_db),
|
||
) -> dict:
|
||
result = await session.execute(
|
||
select(CloudConnection).where(CloudConnection.user_id == current_user.id)
|
||
)
|
||
connections = result.scalars().all()
|
||
return {"items": [CloudConnectionOut.model_validate(c).model_dump() for c in connections]}
|
||
```
|
||
|
||
---
|
||
|
||
## State of the Art
|
||
|
||
| Old Approach | Current Approach | When Changed | Impact |
|
||
|--------------|------------------|--------------|--------|
|
||
| Storing OAuth state in Flask/FastAPI session (in-memory) | Redis TTL-keyed state tokens | ~2022 with multi-instance deployments becoming standard | Multi-instance safety; prevents token fixation |
|
||
| webdav-client-python (original) | webdavclient3 (fork, actively maintained) | 2018 | webdav-client-python is unmaintained; webdavclient3 is the maintained fork |
|
||
| `google.oauth2.credentials.Credentials` with service accounts | `google-auth-oauthlib` Flow for user-delegated access | 2019 | Service accounts require GSuite domain; user OAuth is required for personal Drive |
|
||
| ADAL (Azure Active Directory Authentication Library) for Python | MSAL (Microsoft Authentication Library) | 2020; ADAL deprecated | ADAL end-of-life June 2023; MSAL is the replacement |
|
||
| Using `Fernet.generate_key()` with user passwords | HKDF + Fernet (key derivation before Fernet) | Ongoing best practice | Fernet keys must be 32 random bytes; `generate_key()` generates fresh random keys, not deterministic per-user keys |
|
||
|
||
**Deprecated/outdated:**
|
||
- `adal` Python package: End-of-life; replaced by `msal`. Do NOT use.
|
||
- `webdav-client-python` (without the `3`): Unmaintained since ~2018. Use `webdavclient3`.
|
||
- `google.oauth2.service_account.Credentials`: For service accounts, not user-delegated Drive access. Wrong tool for this use case.
|
||
|
||
---
|
||
|
||
## Assumptions Log
|
||
|
||
| # | Claim | Section | Risk if Wrong |
|
||
|---|-------|---------|---------------|
|
||
| A1 | `webdavclient3` uses `upload_to` / `download_from` method names for stream-based operations | Architecture Patterns Pattern 5 | Planner must verify method signatures against installed package; wrong method names cause `AttributeError` at test time |
|
||
| A2 | Google Drive `googleapiclient.errors.HttpError` status 401 is the token-expiry signal | Pattern 10: On-Demand Token Refresh | Actual exception class may differ; must verify during implementation with a real expired token |
|
||
| A3 | Microsoft Graph `invalid_grant` error appears in `result["error"]` from `msal.acquire_token_by_refresh_token` | Pattern 10 | MSAL may use a different error field or raise an exception; verify against msal docs |
|
||
| A4 | `webdavclient3` percent-encodes paths automatically | Pitfall 2 | May require manual encoding; verify during WebDAV backend implementation |
|
||
| A5 | `tenant_id="common"` works for both personal OneDrive and organizational accounts | Pattern 4: MSAL | May require `"consumers"` for personal accounts; verify against Microsoft docs for the target use case |
|
||
|
||
---
|
||
|
||
## Open Questions (RESOLVED)
|
||
|
||
1. **Google Drive object key scheme for `stat_object`**
|
||
- What we know: MinIO `stat_object` returns size in bytes from the storage layer. Google Drive returns file metadata including `size` from `files.get(fileId, fields='size')`.
|
||
- What's unclear: Google Drive may not return `size` for Google Workspace files (Docs, Sheets, Slides) since they have no binary size. DocuVault uploads binary files, so this may not be an issue in practice.
|
||
- Recommendation: Implement `stat_object` using `service.files().get(fileId=object_key, fields="size").execute()` and return `int(metadata["size"])`. Add a fallback of `0` for files without a size.
|
||
- **RESOLVED:** Use `service.files().get(fileId=object_key, fields="size").execute()` and return `int(metadata.get("size", 0))`. DocuVault only uploads binary files so the `0` fallback handles edge cases without breaking functionality.
|
||
|
||
2. **Nextcloud folder listing path convention**
|
||
- What we know: Nextcloud WebDAV base path is typically `/remote.php/dav/files/{username}/`.
|
||
- What's unclear: Whether the `webdavclient3` `Client` automatically handles the `/remote.php/dav/files/{username}/` prefix or whether it must be included in the `server_url`.
|
||
- Recommendation: Store `server_url` as the full WebDAV root (e.g., `https://nc.example.com/remote.php/dav/files/alice/`) and use relative paths within it. Test with PROPFIND on the root to validate the connection (D-08).
|
||
- **RESOLVED:** `server_url` stores the full WebDAV root including the `/remote.php/dav/files/{username}/` prefix. All relative paths within WebDAVBackend and NextcloudBackend are appended to this base. Connection validation uses a PROPFIND on the root path per D-08.
|
||
|
||
3. **Microsoft Graph upload for files > 4 MB**
|
||
- What we know: Simple upload (PUT `/me/drive/root:/{path}:/content`) is limited to 4 MB. Resumable sessions handle larger files.
|
||
- What's unclear: The Phase 5 plan should specify whether to implement resumable sessions upfront or use a 4 MB size gate.
|
||
- Recommendation: Implement resumable upload session (`createUploadSession`) for all files to avoid the hard limit. It handles both small and large files without a size check.
|
||
- **RESOLVED:** Implement `createUploadSession` for ALL file sizes (no size gate). `CHUNK_SIZE = 10 * 1024 * 1024` (10 MB, above Graph 4 MB limit) used in all OneDrive uploads. Pitfall 6 documented in Common Pitfalls section.
|
||
|
||
---
|
||
|
||
## Environment Availability
|
||
|
||
| Dependency | Required By | Available | Version | Fallback |
|
||
|------------|------------|-----------|---------|----------|
|
||
| Python 3.12 (Docker) | All backends | In Docker container | 3.12.x | — |
|
||
| Redis | OAuth state storage | In Docker Compose | 6.x+ | — |
|
||
| PostgreSQL | cloud_connections table | In Docker Compose | 15.x | — |
|
||
| `cryptography` package | Credential encryption | NOT in requirements.txt | — | Must be added (48.0.0 verified) |
|
||
| `google-auth-oauthlib` | Google Drive OAuth | NOT in requirements.txt | — | Must be added (1.3.1 verified) |
|
||
| `google-api-python-client` | Google Drive API | NOT in requirements.txt | — | Must be added (2.196.0 verified) |
|
||
| `msal` | OneDrive OAuth | NOT in requirements.txt | — | Must be added (1.36.0 verified) |
|
||
| `webdavclient3` | WebDAV/Nextcloud | NOT in requirements.txt | — | Must be added (3.14.7 verified) |
|
||
| `cachetools` | Folder listing cache | NOT in requirements.txt | — | Must be added (6.2.6 verified) |
|
||
| Google OAuth App (Azure/GCP console) | Google Drive integration | NOT CONFIGURED | — | Must be created by user; client_id/client_secret added to .env |
|
||
| Microsoft App Registration (Azure portal) | OneDrive integration | NOT CONFIGURED | — | Must be created by user; client_id/client_secret/tenant_id added to .env |
|
||
|
||
**Missing dependencies with no fallback:**
|
||
- `cryptography`, `google-auth-oauthlib`, `google-api-python-client`, `msal`, `webdavclient3`, `cachetools` — must be added to `requirements.txt` before any cloud backend code runs.
|
||
|
||
**Missing dependencies with fallback (soft):**
|
||
- Google OAuth App credentials: Integration tests for Google Drive will need mocked OAuth flows if real GCP app is not configured. Unit tests can mock the entire OAuth flow.
|
||
- Microsoft App Registration: Same as above for OneDrive.
|
||
|
||
---
|
||
|
||
## Validation Architecture
|
||
|
||
### Test Framework
|
||
|
||
| Property | Value |
|
||
|----------|-------|
|
||
| Framework | pytest + pytest-asyncio (already in requirements.txt) |
|
||
| Config file | `backend/pytest.ini` (already exists) |
|
||
| Quick run command | `cd backend && pytest tests/test_cloud.py -x -v` |
|
||
| Full suite command | `cd backend && pytest -v` |
|
||
|
||
### Phase Requirements → Test Map
|
||
|
||
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|
||
|--------|----------|-----------|-------------------|-------------|
|
||
| CLOUD-01 | User can connect all 4 providers | Integration | `pytest tests/test_cloud.py::test_connect_google_drive -x` | ❌ Wave 0 |
|
||
| CLOUD-01 | OAuth callback validates state and saves connection | Integration | `pytest tests/test_cloud.py::test_oauth_callback_valid_state -x` | ❌ Wave 0 |
|
||
| CLOUD-01 | Invalid OAuth state returns 400 | Integration | `pytest tests/test_cloud.py::test_oauth_callback_invalid_state -x` | ❌ Wave 0 |
|
||
| CLOUD-01 | WebDAV/Nextcloud connection validated before save (D-08) | Integration | `pytest tests/test_cloud.py::test_webdav_connect_validates -x` | ❌ Wave 0 |
|
||
| CLOUD-02 | Credential encryption/decryption round-trip | Unit | `pytest tests/test_cloud.py::test_credential_round_trip -x` | ❌ Wave 0 |
|
||
| CLOUD-02 | `credentials_enc` not in any API response (SEC-08) | Integration | `pytest tests/test_cloud.py::test_credentials_enc_not_exposed -x` | ❌ Wave 0 |
|
||
| CLOUD-03 | Upload to cloud folder goes through FastAPI (not presigned URL) | Integration | `pytest tests/test_cloud.py::test_cloud_upload_no_presigned -x` | ❌ Wave 0 |
|
||
| CLOUD-04 | Connection status displayed correctly | Integration | `pytest tests/test_cloud.py::test_connection_status_display -x` | ❌ Wave 0 |
|
||
| CLOUD-05 | `invalid_grant` → `REQUIRES_REAUTH` transition | Integration | `pytest tests/test_cloud.py::test_invalid_grant_sets_requires_reauth -x` | ❌ Wave 0 |
|
||
| CLOUD-06 | Disconnect permanently deletes credentials | Integration | `pytest tests/test_cloud.py::test_disconnect_deletes_credentials -x` | ❌ Wave 0 |
|
||
| CLOUD-07 | StorageBackend factory returns correct type | Unit | `pytest tests/test_cloud.py::test_factory_returns_correct_backend -x` | ❌ Wave 0 |
|
||
| D-17 | SSRF validation blocks RFC-1918 and loopback | Unit | `pytest tests/test_cloud.py::test_ssrf_validation -x` | ❌ Wave 0 |
|
||
| D-17 | SSRF validation blocks 169.254.x link-local | Unit | `pytest tests/test_cloud.py::test_ssrf_link_local -x` | ❌ Wave 0 |
|
||
| SEC | Admin cannot access cloud connection credentials | Integration | `pytest tests/test_cloud.py::test_admin_cannot_see_credentials -x` | ❌ Wave 0 |
|
||
| SEC | Cross-user cloud connection access returns 404 | Integration | `pytest tests/test_cloud.py::test_cross_user_idor -x` | ❌ Wave 0 |
|
||
|
||
### Sampling Rate
|
||
|
||
- **Per task commit:** `cd backend && pytest tests/test_cloud.py -x -v`
|
||
- **Per wave merge:** `cd backend && pytest -v`
|
||
- **Phase gate:** Full suite green before `/gsd:verify-work`
|
||
|
||
### Wave 0 Gaps
|
||
|
||
- [ ] `backend/tests/test_cloud.py` — all Phase 5 tests (unit + integration), starting with xfail stubs
|
||
- [ ] New conftest fixtures: `mock_google_drive_creds`, `mock_onedrive_creds`, `mock_webdav_client`, `cloud_connection_factory`
|
||
|
||
---
|
||
|
||
## Security Domain
|
||
|
||
### Applicable ASVS Categories
|
||
|
||
| ASVS Category | Applies | Standard Control |
|
||
|---------------|---------|-----------------|
|
||
| V2 Authentication | yes | OAuth2 state CSRF; per-session token; `get_regular_user` dep on all cloud endpoints |
|
||
| V3 Session Management | yes | OAuth state token is single-use; stored in Redis with TTL; deleted after callback |
|
||
| V4 Access Control | yes | Every `/api/cloud/*` endpoint asserts `connection.user_id == current_user.id` before operations |
|
||
| V5 Input Validation | yes | `validate_cloud_url()` for WebDAV/Nextcloud; Pydantic models for all request bodies; no raw string interpolation in URLs |
|
||
| V6 Cryptography | yes | HKDF + Fernet for credential encryption; AES-256 via `cryptography` library (never hand-rolled) |
|
||
| V7 Error Handling | yes | `invalid_grant` handled explicitly (D-06); no stack traces in cloud API error responses |
|
||
|
||
### Known Threat Patterns for OAuth + Cloud Storage
|
||
|
||
| Pattern | STRIDE | Standard Mitigation |
|
||
|---------|--------|---------------------|
|
||
| CSRF on OAuth callback | Tampering | `state` parameter validated via Redis; state token is `secrets.token_urlsafe(32)` |
|
||
| SSRF via WebDAV/Nextcloud URL | Tampering / Information Disclosure | `validate_cloud_url()` at connect-time and before each request; `ipaddress` module DNS resolution check |
|
||
| Credential exposure via API leak | Information Disclosure | `CloudConnectionOut` Pydantic whitelist; `credentials_enc` excluded by omission |
|
||
| Token replay via OAuth state | Elevation of Privilege | Redis single-use deletion after callback; 30-minute TTL prevents stale states |
|
||
| Cross-user cloud connection access | IDOR | `connection.user_id == current_user.id` assertion on every operation; 404 not 403 |
|
||
| Unverified credentials stored (D-08) | Information Disclosure / DoS | PROPFIND/OPTIONS validation before storage; error returned on failure |
|
||
| Refresh token theft from DB | Information Disclosure | `credentials_enc` is Fernet-encrypted with HKDF per-user key; master key in env var only |
|
||
| Admin accessing user cloud credentials | Broken Access Control | `get_regular_user` dep blocks admin (403); `CloudConnectionOut` whitelist on all responses |
|
||
| DNS rebinding SSRF bypass | Tampering | `validate_cloud_url()` called immediately before each outbound request (not only at connect-time); documented defense-in-depth via network egress firewall |
|
||
|
||
---
|
||
|
||
## Project Constraints (from CLAUDE.md)
|
||
|
||
The following CLAUDE.md directives are binding for Phase 5:
|
||
|
||
- JWT access token lives in Pinia memory only — never localStorage or sessionStorage (OAuth callback must redirect to Vue with a query param, not embed tokens in the URL)
|
||
- Cloud credentials encrypted with HKDF per-user key derivation — master key in env var only
|
||
- Admin endpoints never return `credentials_enc`
|
||
- Every cloud connection endpoint asserts `resource.user_id == current_user.id`
|
||
- All DB queries via ORM / parameterized statements — zero raw string interpolation
|
||
- `get_regular_user` on all cloud connection endpoints (admin blocked from this surface)
|
||
- `write_audit_log()` called on cloud connect, disconnect, and re-auth events
|
||
- Testing protocol: every new function, endpoint, and component must have at least one test; `pytest -v` must pass zero failures
|
||
- Security gate: `bandit -r backend/`, `pip audit`, `npm audit --audit-level=high` must all pass before phase advancement
|
||
- Bug fix rule: root cause only, ≤50 lines, regression test required
|
||
|
||
---
|
||
|
||
## Sources
|
||
|
||
### Primary (HIGH confidence)
|
||
|
||
- `backend/storage/base.py` — StorageBackend ABC, 7 abstract methods, exact signatures
|
||
- `backend/storage/minio_backend.py` — asyncio.to_thread() wrapping pattern, error handling shape
|
||
- `backend/storage/__init__.py` — factory pattern to extend
|
||
- `backend/db/models.py` — CloudConnection model fields, Document.storage_backend, User.default_storage_backend
|
||
- `backend/api/admin.py` — CloudConnectionOut Pydantic whitelist pattern (already exists)
|
||
- `backend/main.py` — Redis wiring on app.state.redis, lifespan pattern
|
||
- `backend/deps/auth.py` — get_regular_user, get_current_user patterns
|
||
- `backend/migrations/versions/0001_initial_schema.py` — confirmed cloud_connections table, storage_backend columns
|
||
- [cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/](https://cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/) — HKDF usage and info parameter
|
||
- [cryptography.io/en/latest/fernet/](https://cryptography.io/en/latest/fernet/) — Fernet key format
|
||
- [googleapis.dev/python/google-auth-oauthlib/latest](https://googleapis.dev/python/google-auth-oauthlib/latest/reference/google_auth_oauthlib.flow.html) — Flow class API
|
||
- PyPI `pip download` — confirmed versions: cryptography-48.0.0, google_auth_oauthlib-1.3.1, google_api_python_client-2.196.0, msal-1.36.0, webdavclient3-3.14.7, cachetools-6.2.6
|
||
- slopcheck 0.6.1 — all 7 packages rated [OK]
|
||
|
||
### Secondary (MEDIUM confidence)
|
||
|
||
- [learn.microsoft.com/en-us/entra/msal/python/](https://learn.microsoft.com/en-us/entra/msal/python/) — MSAL Python overview and authorization code flow
|
||
- [cachetools.readthedocs.io](https://cachetools.readthedocs.io/en/stable/) — TTLCache thread safety requirement
|
||
- [cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html](https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html) — DNS resolution-based SSRF check
|
||
|
||
### Tertiary (LOW confidence / ASSUMED)
|
||
|
||
- webdavclient3 specific method names (`upload_to`, `download_from`) — marked [ASSUMED] above; verify during implementation
|
||
- Exact Microsoft Graph error field for `invalid_grant` in MSAL — marked [ASSUMED] above
|
||
|
||
---
|
||
|
||
## Metadata
|
||
|
||
**Confidence breakdown:**
|
||
- Standard stack: HIGH — all packages verified on PyPI, slopcheck clean, versions confirmed
|
||
- Architecture: HIGH — built directly from codebase inspection; ABC, factory, CloudConnection model, Redis wiring all verified
|
||
- OAuth2 flows: MEDIUM/HIGH — google-auth-oauthlib Flow API verified via official docs; MSAL pattern confirmed via Microsoft docs
|
||
- Pitfalls: HIGH — based on official library docs and known OAuth edge cases
|
||
- SSRF prevention: HIGH — Python stdlib ipaddress module; OWASP-cited approach
|
||
|
||
**Research date:** 2026-05-28
|
||
**Valid until:** 2026-06-28 (30 days) — package versions are stable but verify before pinning in requirements.txt
|