# Phase 5: Cloud Storage Backends — Research
**Researched:** 2026-05-28
**Domain:** OAuth2 cloud provider integration, WebDAV/Nextcloud, credential encryption, SSRF prevention, StorageBackend ABC extension
**Confidence:** HIGH (all package versions verified on PyPI; patterns verified against official docs and codebase)
---
## User Constraints (from CONTEXT.md)
### Locked Decisions
- **D-01:** All 4 providers (OneDrive/Microsoft Graph, Google Drive v3, Nextcloud, WebDAV) delivered in this single phase.
- **D-02:** Each provider is a concrete `StorageBackend` subclass in `backend/storage/` (e.g., `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`).
- **D-03:** FastAPI owns the OAuth callback. Flow: user clicks "Connect" → provider OAuth consent page → `GET /api/cloud/oauth/callback/{provider}?code=…&state=…` → FastAPI exchanges code, encrypts credentials, saves to `cloud_connections`, then redirects browser to Vue settings page with `?cloud_connected=google_drive` (or `?cloud_error=…`). Auth code and tokens never land in the frontend.
- **D-04:** OAuth state parameter encodes the authenticated user's ID (signed or encrypted) using `secrets.token_urlsafe(32)` + a short-lived server-side state store (Redis or DB) to validate the callback matches the initiating user session.
- **D-05:** Access token refresh is on-demand and transparent. When a cloud API call fails with token-expiry (HTTP 401), the backend catches it, uses the stored refresh token, updates `credentials_enc` in DB, and retries the original call within the same request.
- **D-06:** If the refresh token is rejected by the provider (`invalid_grant`), the connection status transitions to `REQUIRES_REAUTH` and the request returns an error telling the user to reconnect. No silent failure.
- **D-07:** UI presents both auth methods for Nextcloud/WebDAV (real account password and app-specific password) with clear recommendation for app password.
- **D-08:** On save, backend validates the WebDAV/Nextcloud connection (lightweight PROPFIND or OPTIONS request) before storing credentials. If validation fails, return an error — never store unverified credentials.
- **D-09:** Sidebar shows local MinIO folders first, then each connected cloud provider as a peer top-level node. Lazy-load one level at a time.
- **D-10:** Upload destination follows the active folder context. Cloud uploads go through FastAPI intermediary — no direct browser-to-cloud.
- **D-11:** Existing MinIO documents stay in MinIO — no migration. `storage_backend="minio"` for existing docs; `"google_drive"`, `"onedrive"`, etc. for new cloud docs.
- **D-12:** Cloud provider management lives in a new "Cloud Storage" tab in SettingsView.
- **D-13:** Multiple cloud providers can be connected simultaneously (one row per provider in `cloud_connections`).
- **D-14:** Cloud backends: `generate_presigned_put_url` raises `NotImplementedError`. Upload endpoint detects cloud backends and uses direct upload path.
- **D-15:** Downloads/previews use the same `GET /api/documents/{id}/content` proxy endpoint regardless of backend. Calls `storage_backend.get_object(document.object_key)` and streams bytes to browser.
- **D-16:** Cloud folder tree browsing is live API calls with a 60-second in-memory TTL cache (keyed by `user_id + provider + folder_path`). Not Redis — in-memory is sufficient.
- **D-17:** All outbound HTTP to WebDAV/Nextcloud validates URL against SSRF blocklist (localhost, 127.x, 169.254.x, RFC 1918, ::1). Validation in a shared `validate_cloud_url()` utility called before every request.
- **D-18:** `credentials_enc` encrypted with `HKDF(CLOUD_CREDS_KEY, salt=user_id_bytes, info=b"cloud-credentials")`. Master key in `CLOUD_CREDS_KEY` env var. Never stored unencrypted. Never returned in any API response.
- **D-19:** Admin API responses for cloud connections return only `provider, display_name, connected_at, status` (CloudConnectionOut Pydantic whitelist pattern from Phase 4).
### Claude's Discretion
- Choice of Python OAuth client library for Google Drive and OneDrive (e.g., `google-auth-oauthlib`, `msal`).
- Choice of WebDAV Python library (e.g., `webdavclient3`, `aiohttp` with manual PROPFIND).
- Exact TTL cache implementation (dict + timestamp vs. `cachetools.TTLCache`).
- OAuth state store implementation (Redis vs. short-lived DB row vs. signed JWT).
### Deferred Ideas (OUT OF SCOPE)
- Document migration between backends (user-initiated move of MinIO docs to cloud).
- Cloud-native resumable upload URLs (provider-specific presigned upload sessions).
- Shared cloud storage (team/organization).
- Cloud folder sync / offline cache.
- Email notifications on REQUIRES_REAUTH.
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|------------------|
| CLOUD-01 | User can connect OneDrive (Microsoft Graph), Google Drive (v3 API), Nextcloud, or generic WebDAV as a personal storage backend | MSAL + google-auth-oauthlib OAuth2 flows; webdavclient3 for WebDAV/Nextcloud |
| CLOUD-02 | Cloud OAuth credentials encrypted using HKDF per-user key derivation (`HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials")`); master key in `CLOUD_CREDS_KEY` env var | `cryptography` library HKDF + Fernet pattern documented |
| CLOUD-03 | Local MinIO storage and connected cloud backends coexist; user can select their default storage destination | `documents.storage_backend` column already in schema; `users.default_storage_backend` column already present |
| CLOUD-04 | Each cloud connection displays status: `ACTIVE | REQUIRES_REAUTH | ERROR` | `CloudConnection.status` column already in schema |
| CLOUD-05 | On OAuth revocation (`invalid_grant`), connection status transitions to `REQUIRES_REAUTH` — surfaced to user, not retried silently | On-demand token refresh pattern with `invalid_grant` catch documented |
| CLOUD-06 | User can disconnect a cloud backend; credentials are permanently deleted from the DB | `DELETE /api/cloud/connections/{id}` with ownership check |
| CLOUD-07 | Storage backend abstracted via `StorageBackend` ABC + factory in `storage/` module (mirrors existing `ai/` provider pattern) | ABC already exists with 7 abstract methods; factory already in `storage/__init__.py` |
---
## Summary
Phase 5 extends DocuVault's existing storage abstraction with four cloud provider backends. The infrastructure is largely pre-built: the `StorageBackend` ABC with 7 abstract methods already exists (`backend/storage/base.py`), the `cloud_connections` table with all required columns (`id`, `user_id`, `provider`, `credentials_enc`, `status`, `connected_at`) was created in migration 0001, the `documents.storage_backend` column already exists, and `users.default_storage_backend` already exists. No new Alembic migration is needed for the data model.
The three main implementation challenges are: (1) the OAuth2 callback flow where FastAPI owns both the initiation and code-exchange, (2) per-user HKDF credential encryption using the `cryptography` library (which is **not currently in `requirements.txt`** and must be added), and (3) SSRF prevention for user-supplied WebDAV/Nextcloud URLs using Python's built-in `ipaddress` module. Redis is already wired on `app.state.redis` and is the correct choice for OAuth state storage (TTL-backed, eliminates race conditions in multi-instance deployments, already proven pattern in auth.py for TOTP replay prevention).
The WebDAV/Nextcloud backends should use `webdavclient3` wrapped in `asyncio.to_thread()` (matching the MinIOBackend pattern) rather than an async-native library — `webdavclient3` is the most mature option (8+ years old, actively maintained) and its sync API is well-documented. Google Drive uses `google-api-python-client` + `google-auth-oauthlib`; OneDrive uses `msal` with the authorization code flow. Both sync SDKs wrap in `asyncio.to_thread()`.
**Primary recommendation:** Add `cryptography>=41.0.0`, `google-auth-oauthlib>=1.3.1`, `google-api-python-client>=2.196.0`, `msal>=1.36.0`, and `webdavclient3>=3.14.7` to `requirements.txt`. Implement OAuth state via Redis TTL (30-minute expiry). Use `cachetools.TTLCache` (already available on PyPI, version 6.2.6 verified) for the 60-second folder listing cache. Use Python's built-in `ipaddress` module for SSRF URL validation — no additional library needed.
---
## Architectural Responsibility Map
| Capability | Primary Tier | Secondary Tier | Rationale |
|------------|-------------|----------------|-----------|
| OAuth2 initiation (redirect URL generation) | API / Backend | — | Secrets (client_id, client_secret) must never reach the browser |
| OAuth2 callback code exchange | API / Backend | — | Auth code + client_secret exchange is a server-to-server operation (D-03) |
| OAuth state CSRF validation | API / Backend (Redis) | — | State token must be stored server-side and expire after use (D-04) |
| Credential encryption/decryption | API / Backend | — | HKDF master key lives in env var; decryption happens at API layer only |
| Cloud file upload | API / Backend | Cloud Provider API | Bytes pass through FastAPI intermediary — no direct browser-to-cloud (D-10) |
| Cloud file download/preview | API / Backend | Cloud Provider API | Same proxy endpoint as MinIO (D-15) |
| Cloud folder tree listing | API / Backend | Cloud Provider API | Lazy-load, TTL-cached in FastAPI app state (D-16) |
| SSRF validation | API / Backend | — | Must run before every outbound HTTP call; not frontend-accessible (D-17) |
| Connection status display | Frontend / Client | — | UI reads `status` field from API; no direct cloud calls from browser |
| Cloud Storage settings tab | Frontend / Client | — | New tab in SettingsView; reads/writes via `/api/cloud/connections` |
| On-demand token refresh | API / Backend | — | Transparent to user; handled within the request lifecycle (D-05) |
| Default storage backend selection | API / Backend + DB | Frontend / Client | `users.default_storage_backend` column; UI reads/writes via settings endpoint |
---
## Standard Stack
### Core (new additions to requirements.txt)
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| `cryptography` | 48.0.0 | HKDF key derivation + Fernet encryption for `credentials_enc` | The only Python library with official HKDF + Fernet in one package; already referenced in CLAUDE.md |
| `google-auth-oauthlib` | 1.3.1 | Google OAuth2 authorization code flow; `Flow` class manages URL generation and code exchange | Official Google library; listed in Google's own Python quickstart |
| `google-api-python-client` | 2.196.0 | Google Drive v3 API (files.get, files.create, files.delete, files.list) | Official Google library; required alongside google-auth-oauthlib for Drive operations |
| `msal` | 1.36.0 | Microsoft Authentication Library — authorization code flow for OneDrive/Microsoft Graph | Official Microsoft library; only sanctioned way to obtain Microsoft Graph tokens |
| `webdavclient3` | 3.14.7 | WebDAV operations (PROPFIND, upload, download, delete) for both Nextcloud and generic WebDAV | Mature (8 years), actively maintained, supports Nextcloud and all standard WebDAV servers |
| `cachetools` | 6.2.6 | `TTLCache` for 60-second folder listing cache in FastAPI app state (D-16) | Standard cache library; pure Python; no new infrastructure dependency |
[VERIFIED: npm registry / PyPI] — all versions confirmed via `pip download` against PyPI registry.
### Already in requirements.txt (relevant to Phase 5)
| Library | Current Version Spec | Phase 5 Use |
|---------|---------------------|-------------|
| `httpx` | >=0.27 | Microsoft Graph REST calls (aiohttp alternative); already used for HIBP |
| `redis` | >=4.6.0 | OAuth state storage (TTL-keyed state tokens, already on `app.state.redis`) |
| `aioredis` | via `redis[asyncio]` | Already wired in `main.py` lifespan |
| `pydantic` | >=2.0 | Request/response models for new cloud endpoints |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| `webdavclient3` | `aiohttp` + raw PROPFIND XML | webdavclient3 handles XML parsing, redirect following, and auth headers; raw aiohttp requires implementing RFC 4918 manually |
| `webdavclient3` | `aiodav` / `aiowebdav2` | These async WebDAV libs are very new (< 2 years old, low download counts); webdavclient3 wrapped in `asyncio.to_thread()` matches the MinIOBackend pattern and is safer |
| `msal` (for OneDrive) | `requests-oauthlib` + raw Graph calls | MSAL handles token refresh, token cache, and `invalid_grant` detection natively |
| `cachetools.TTLCache` | `dict` + timestamp | TTLCache has automatic expiry and LRU eviction; manual dict+timestamp requires cleanup logic; both work, TTLCache is cleaner |
| Redis for OAuth state | Signed JWT state | Redis is already wired; TTL-keyed Redis entries are the proven pattern (auth.py TOTP replay prevention). Signed JWT state is viable but requires HMAC secret management for state-only tokens |
**Installation:**
```bash
# Add to backend/requirements.txt
cryptography>=41.0.0
google-auth-oauthlib>=1.3.1
google-api-python-client>=2.196.0
msal>=1.36.0
webdavclient3>=3.14.7
cachetools>=5.3.0
```
**Version verification:** Confirmed against PyPI via `pip download`:
- `cryptography-48.0.0` — `[VERIFIED: PyPI]`
- `google_auth_oauthlib-1.3.1` — `[VERIFIED: PyPI]`
- `google_api_python_client-2.196.0` — `[VERIFIED: PyPI]`
- `msal-1.36.0` — `[VERIFIED: PyPI]`
- `webdavclient3-3.14.7` — `[VERIFIED: PyPI]`
- `cachetools-6.2.6` — `[VERIFIED: PyPI]`
---
## Package Legitimacy Audit
All packages verified via slopcheck 0.6.1 (run 2026-05-28):
| Package | Registry | Age | Downloads | Source Repo | slopcheck | Disposition |
|---------|----------|-----|-----------|-------------|-----------|-------------|
| `cryptography` | PyPI | 12+ yrs | 100M+/wk | github.com/pyca/cryptography | [OK] | Approved |
| `google-auth-oauthlib` | PyPI | 7+ yrs | 50M+/wk | github.com/googleapis/google-auth-library-python-oauthlib | [OK] | Approved |
| `google-api-python-client` | PyPI | 10+ yrs | 30M+/wk | github.com/googleapis/google-api-python-client | [OK] — note: "Name ends with '-client' — looks like LLM bait but package is established" | Approved |
| `msal` | PyPI | 6+ yrs | 10M+/wk | github.com/AzureAD/microsoft-authentication-library-for-python | [OK] | Approved |
| `webdavclient3` | PyPI | 8+ yrs | 200K+/wk | github.com/CloudPolis/webdavclient3 | [OK] | Approved |
| `cachetools` | PyPI | 10+ yrs | 80M+/wk | github.com/tkem/cachetools | [OK] | Approved |
**Packages removed due to slopcheck [SLOP] verdict:** none
**Packages flagged as suspicious [SUS]:** none
---
## Architecture Patterns
### System Architecture Diagram
```
Browser (Vue 3)
│
│ Click "Connect Google Drive"
▼
[GET /api/cloud/oauth/initiate/google_drive]
│ 1. Generate state_token = secrets.token_urlsafe(32)
│ 2. Store Redis: oauth_state:{state_token} = user_id (TTL 30 min)
│ 3. Build authorization_url via google_auth_oauthlib.Flow
│ 4. HTTP 302 redirect → Google OAuth consent page
▼
Google OAuth Consent Page (browser)
│ User approves
│ Google redirects to:
▼
[GET /api/cloud/oauth/callback/google_drive?code=...&state=...]
│ 1. Validate state → lookup Redis oauth_state:{state} → get user_id
│ 2. Delete Redis key (prevent replay)
│ 3. Exchange code → tokens via flow.fetch_token()
│ 4. Serialize credentials (access_token, refresh_token, expiry)
│ 5. Encrypt with HKDF-derived per-user Fernet key
│ 6. Save/upsert cloud_connections row (user_id, provider, credentials_enc, status=ACTIVE)
│ 7. HTTP 302 redirect → Vue /settings?cloud_connected=google_drive
▼
Vue SettingsView (onMounted)
│ Reads ?cloud_connected=google_drive
│ Shows success toast
▼
[GET /api/cloud/connections]
│ Lists all cloud connections for current user
│ Returns CloudConnectionOut (no credentials_enc)
▼
Browser renders Cloud Storage tab with connection status badges
─────── Document Upload to Cloud Folder ───────
Browser (Vue 3)
│ User is viewing Google Drive folder node
│ Drops file
▼
[POST /api/documents/upload]
│ active folder context = cloud folder (provider=google_drive, folder_id=...)
│ 1. Load CloudConnection for user + provider
│ 2. Decrypt credentials_enc → Fernet key → credentials dict
│ 3. Check token expiry → if expired, refresh transparently (D-05)
│ 4. Call google_drive_backend.put_object(user_id, doc_id, bytes, ext, ct)
│ └── asyncio.to_thread → drive.files().create(...)
│ 5. Save Document(storage_backend="google_drive", object_key=drive_file_id)
▼
Browser shows upload progress (same UploadProgress component)
─────── Document Download from Cloud ───────
[GET /api/documents/{id}/content]
│ 1. Load Document → storage_backend = "google_drive"
│ 2. get_storage_backend("google_drive", user_id, session) → GoogleDriveBackend
│ 3. backend.get_object(object_key) → bytes
│ 4. StreamingResponse to browser
▼
Browser renders PDF in existing DocumentPreviewModal
─────── WebDAV/Nextcloud Connection ───────
Browser
│ User submits server_url + username + password (or app password)
▼
[POST /api/cloud/connections/webdav]
│ 1. validate_cloud_url(server_url) → SSRF check (ipaddress module)
│ 2. Test connection: PROPFIND server_url (lightweight)
│ 3. If success: encrypt credentials → save cloud_connections
│ 4. If fail: 422 with error message (D-08)
▼
Browser shows ACTIVE status badge
```
### Recommended Project Structure
```
backend/storage/
├── base.py # existing StorageBackend ABC (7 abstract methods)
├── __init__.py # extend get_storage_backend() factory
├── minio_backend.py # existing reference implementation
├── google_drive_backend.py # new: Google Drive v3
├── onedrive_backend.py # new: Microsoft Graph / OneDrive
├── nextcloud_backend.py # new: Nextcloud (WebDAV + status endpoint)
├── webdav_backend.py # new: generic WebDAV
└── cloud_utils.py # new: validate_cloud_url(), encrypt_credentials(), decrypt_credentials()
backend/api/
└── cloud.py # new: all /api/cloud/* endpoints
backend/services/
└── cloud_cache.py # new: TTLCache singleton for folder listings
backend/tests/
└── test_cloud.py # new: all Phase 5 tests
```
### Pattern 1: StorageBackend ABC Contract (7 methods)
The existing ABC requires all 7 methods. Cloud backends raise `NotImplementedError` for `generate_presigned_put_url` per D-14:
```python
# Source: backend/storage/base.py (verified in codebase)
class StorageBackend(ABC):
@abstractmethod
async def put_object(self, user_id, document_id, file_bytes, extension, content_type) -> str: ...
@abstractmethod
async def get_object(self, object_key: str) -> bytes: ...
@abstractmethod
async def delete_object(self, object_key: str) -> None: ...
@abstractmethod
async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str: ...
@abstractmethod
async def health_check(self) -> bool: ...
@abstractmethod
async def generate_presigned_put_url(self, object_key: str, expires_minutes: int = 15) -> str: ...
@abstractmethod
async def stat_object(self, object_key: str) -> int: ...
```
Cloud backends implement all 7. For `generate_presigned_put_url` and `presigned_get_url`, cloud backends raise `NotImplementedError` — the upload endpoint detects cloud backends and uses the direct path (D-14). For `stat_object`, cloud backends return file size from the provider's metadata response.
The `object_key` for cloud backends is the **provider's native file ID** (e.g., Google Drive file ID, OneDrive item ID, WebDAV path). The STORE-02 key schema (`{user_id}/{document_id}/{uuid4()}{ext}`) applies only to MinIO.
### Pattern 2: HKDF + Fernet Credential Encryption
```python
# Source: cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/
# [VERIFIED: CITED: cryptography.io]
import base64
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from cryptography.fernet import Fernet
def _derive_fernet_key(master_key: bytes, user_id: str) -> Fernet:
"""Derive a per-user Fernet key using HKDF-SHA256.
master_key = CLOUD_CREDS_KEY env var as bytes
salt = user_id bytes (deterministic per user — we need same key on decrypt)
info = b"cloud-credentials" (domain separation)
"""
hkdf = HKDF(
algorithm=hashes.SHA256(),
length=32,
salt=user_id.encode("utf-8"), # deterministic salt = user_id
info=b"cloud-credentials",
)
raw_key = hkdf.derive(master_key)
fernet_key = base64.urlsafe_b64encode(raw_key)
return Fernet(fernet_key)
def encrypt_credentials(master_key: bytes, user_id: str, credentials: dict) -> str:
"""Encrypt credentials dict to base64 Fernet token string."""
import json
f = _derive_fernet_key(master_key, user_id)
plaintext = json.dumps(credentials).encode("utf-8")
return f.encrypt(plaintext).decode("utf-8")
def decrypt_credentials(master_key: bytes, user_id: str, credentials_enc: str) -> dict:
"""Decrypt credentials_enc back to dict."""
import json
f = _derive_fernet_key(master_key, user_id)
plaintext = f.decrypt(credentials_enc.encode("utf-8"))
return json.loads(plaintext)
```
**Critical note:** HKDF is **not** reusable — a new `HKDF` instance must be created for each derivation call. The `cryptography` library raises `AlreadyFinalized` if `.derive()` is called twice on the same instance. The `_derive_fernet_key` function must create a fresh `HKDF` instance each call.
### Pattern 3: Google Drive OAuth2 Flow via google-auth-oauthlib
```python
# Source: googleapis.dev/python/google-auth-oauthlib/latest (VERIFIED: official docs)
from google_auth_oauthlib.flow import Flow
# At initiation:
flow = Flow.from_client_config(
{
"web": {
"client_id": settings.google_client_id,
"client_secret": settings.google_client_secret,
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
}
},
scopes=["https://www.googleapis.com/auth/drive.file"],
)
flow.redirect_uri = f"{settings.backend_url}/api/cloud/oauth/callback/google_drive"
authorization_url, state = flow.authorization_url(access_type="offline", prompt="consent")
# Store state → Redis (key: oauth_state:{state}, value: user_id, TTL 30 min)
# Redirect browser to authorization_url
# At callback:
# Restore flow from client config (stateless — recreate Flow on each callback)
flow = Flow.from_client_config(client_config, scopes=[...], state=state)
flow.redirect_uri = redirect_uri
flow.fetch_token(code=code)
creds = flow.credentials
# creds.token = access token
# creds.refresh_token = refresh token
# creds.expiry = datetime
```
**`access_type="offline"` is required** to obtain a refresh token. Without it, Google only returns a short-lived access token. `prompt="consent"` forces re-consent on each connect, which ensures a fresh refresh token.
### Pattern 4: OneDrive OAuth2 Flow via MSAL
```python
# Source: learn.microsoft.com/en-us/entra/msal/python/ [CITED]
import msal
# Confidential client app (has client_secret)
app = msal.ConfidentialClientApplication(
client_id=settings.onedrive_client_id,
client_credential=settings.onedrive_client_secret,
authority=f"https://login.microsoftonline.com/{settings.onedrive_tenant_id}",
)
# At initiation:
auth_url = app.get_authorization_request_url(
scopes=["Files.ReadWrite", "offline_access"],
redirect_uri=f"{settings.backend_url}/api/cloud/oauth/callback/onedrive",
state=state_token,
)
# Redirect browser to auth_url
# At callback:
result = app.acquire_token_by_authorization_code(
code=code,
scopes=["Files.ReadWrite", "offline_access"],
redirect_uri=redirect_uri,
)
# result["access_token"] — short-lived access token
# result["refresh_token"] — long-lived refresh token
# result["expires_in"] — seconds until access_token expires
# Refresh on-demand (D-05):
result = app.acquire_token_by_refresh_token(
refresh_token=stored_refresh_token,
scopes=["Files.ReadWrite", "offline_access"],
)
# If result.get("error") == "invalid_grant" → REQUIRES_REAUTH (D-06)
```
**`offline_access` scope is required** to obtain a refresh token from Microsoft identity platform. The `tenant_id` can be `"common"` for multi-tenant apps (personal OneDrive and organizational accounts). For personal OneDrive only, use `"consumers"`.
### Pattern 5: WebDAV Operations via webdavclient3 + asyncio.to_thread
```python
# Source: pypi.org/project/webdavclient3 (VERIFIED: PyPI) [ASSUMED: specific API usage]
import asyncio
from webdav3.client import Client
class WebDAVBackend(StorageBackend):
def __init__(self, server_url: str, username: str, password: str):
options = {
"webdav_hostname": server_url,
"webdav_login": username,
"webdav_password": password,
}
self._client = Client(options)
self._base_path = "docuvault/" # namespace prefix in WebDAV tree
async def put_object(self, user_id, document_id, file_bytes, extension, content_type) -> str:
# object_key = WebDAV path used as identifier
object_key = f"docuvault/{user_id}/{document_id}{extension}"
import io
buf = io.BytesIO(file_bytes)
await asyncio.to_thread(
self._client.upload_to, buf, object_key
)
return object_key
async def get_object(self, object_key: str) -> bytes:
import io
buf = io.BytesIO()
await asyncio.to_thread(self._client.download_from, buf, object_key)
return buf.getvalue()
```
Note: `webdavclient3` is synchronous. All calls MUST be wrapped in `asyncio.to_thread()` — same pattern as `MinIOBackend`. [ASSUMED: `upload_to`/`download_from` method names — verify against installed package docs]
### Pattern 6: SSRF Prevention via ipaddress Module
```python
# Source: python.org/library/ipaddress [VERIFIED: Python stdlib]
import ipaddress
import socket
from urllib.parse import urlparse
BLOCKED_NETS = [
ipaddress.ip_network("127.0.0.0/8"), # loopback
ipaddress.ip_network("169.254.0.0/16"), # link-local
ipaddress.ip_network("10.0.0.0/8"), # RFC 1918
ipaddress.ip_network("172.16.0.0/12"), # RFC 1918
ipaddress.ip_network("192.168.0.0/16"), # RFC 1918
ipaddress.ip_network("::1/128"), # IPv6 loopback
ipaddress.ip_network("fc00::/7"), # IPv6 ULA
]
def validate_cloud_url(url: str) -> None:
"""Raise ValueError if url targets a private/internal address.
Called at connect-time and before every WebDAV/Nextcloud request.
D-17: blocks localhost, 127.x, 169.254.x, RFC 1918 ranges, ::1.
"""
parsed = urlparse(url)
if parsed.scheme not in ("http", "https"):
raise ValueError(f"Unsupported scheme: {parsed.scheme}")
hostname = parsed.hostname
if not hostname:
raise ValueError("URL has no hostname")
# Resolve hostname to IP
try:
addr = ipaddress.ip_address(hostname)
except ValueError:
# Not a raw IP — resolve via DNS
try:
resolved = socket.getaddrinfo(hostname, None)[0][4][0]
addr = ipaddress.ip_address(resolved)
except (socket.gaierror, ValueError) as exc:
raise ValueError(f"Cannot resolve hostname: {exc}") from exc
for net in BLOCKED_NETS:
if addr in net:
raise ValueError(f"URL targets a private/internal address: {addr}")
```
**Security note:** DNS-based SSRF bypass is a known attack vector — an attacker registers a DNS name that resolves to an internal IP. The `validate_cloud_url` function must resolve DNS and check the resolved IP, not just the hostname string. This pattern is the OWASP-recommended approach. [CITED: cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html]
### Pattern 7: OAuth State Storage via Redis
```python
# Source: established pattern from backend/api/auth.py (VERIFIED: codebase)
# Redis is already on app.state.redis (aioredis client)
# At OAuth initiation:
state_token = secrets.token_urlsafe(32)
redis_key = f"oauth_state:{state_token}"
await request.app.state.redis.setex(
redis_key,
1800, # 30-minute TTL — long enough for user to complete OAuth consent
str(current_user.id),
)
# Return redirect to authorization_url with state=state_token
# At OAuth callback:
redis_key = f"oauth_state:{state}"
user_id_bytes = await request.app.state.redis.get(redis_key)
if not user_id_bytes:
raise HTTPException(400, "Invalid or expired OAuth state")
await request.app.state.redis.delete(redis_key) # single-use
user_id = uuid.UUID(user_id_bytes.decode())
```
This follows the exact same pattern as TOTP replay prevention in `auth.py` — Redis TTL key, single-use deletion after validation.
### Pattern 8: TTLCache for Folder Listings (cachetools)
```python
# Source: cachetools.readthedocs.io [CITED]
import threading
from cachetools import TTLCache
# In FastAPI lifespan or module-level singleton
# maxsize=1000: enough for ~50 users × 20 folder nodes each
# ttl=60: 60-second cache per D-16
_folder_cache: TTLCache = TTLCache(maxsize=1000, ttl=60)
_folder_cache_lock = threading.Lock()
async def get_cloud_folders_cached(user_id: str, provider: str, folder_id: str, fetch_fn) -> list:
"""Return cached result or call fetch_fn and cache it."""
cache_key = f"{user_id}:{provider}:{folder_id}"
with _folder_cache_lock:
if cache_key in _folder_cache:
return _folder_cache[cache_key]
result = await fetch_fn() # async — outside the lock
with _folder_cache_lock:
_folder_cache[cache_key] = result
return result
```
**Thread safety:** `cachetools.TTLCache` is not thread-safe by itself. A `threading.Lock` is required for concurrent access. The fetch function itself is async and must be called outside the lock to avoid blocking the event loop. [CITED: cachetools.readthedocs.io — "access to a shared cache from multiple threads must be properly synchronized"]
### Pattern 9: Factory Extension (get_storage_backend)
```python
# Source: backend/storage/__init__.py (VERIFIED: codebase)
# Current factory only returns MinIOBackend. Phase 5 extends it:
async def get_storage_backend_for_document(
document: Document,
user: User,
session: AsyncSession,
) -> StorageBackend:
"""Return the correct StorageBackend for the given document.
MinIO documents (storage_backend='minio'): return shared MinIOBackend.
Cloud documents: load CloudConnection, decrypt credentials, return backend instance.
"""
if document.storage_backend == "minio":
return get_storage_backend() # existing factory
# Load cloud connection
result = await session.execute(
select(CloudConnection).where(
CloudConnection.user_id == user.id,
CloudConnection.provider == document.storage_backend,
CloudConnection.status == "ACTIVE",
)
)
conn = result.scalar_one_or_none()
if conn is None:
raise HTTPException(503, "Cloud connection not found or inactive")
master_key = settings.cloud_creds_key.encode()
credentials = decrypt_credentials(master_key, str(user.id), conn.credentials_enc)
if document.storage_backend == "google_drive":
return GoogleDriveBackend(credentials)
elif document.storage_backend == "onedrive":
return OneDriveBackend(credentials)
elif document.storage_backend in ("nextcloud", "webdav"):
return WebDAVBackend(credentials["server_url"], credentials["username"], credentials["password"])
else:
raise ValueError(f"Unknown storage backend: {document.storage_backend}")
```
### Pattern 10: On-Demand Token Refresh (D-05)
```python
# Source: D-05 decision (CONTEXT.md) [ASSUMED: exact error class names]
class GoogleDriveBackend(StorageBackend):
async def _call_with_refresh(self, operation_fn, credentials: dict, user_id: str, conn: CloudConnection, session):
"""Attempt operation; on 401, refresh tokens and retry once."""
try:
return await operation_fn(credentials)
except Exception as e:
# Google Drive: googleapiclient.errors.HttpError with status 401
if _is_token_expired_error(e):
new_creds = await self._refresh_token(credentials)
if new_creds is None:
# invalid_grant — set REQUIRES_REAUTH (D-06)
conn.status = "REQUIRES_REAUTH"
await session.commit()
raise CloudConnectionError("Cloud connection requires re-authentication")
# Update credentials_enc
master_key = settings.cloud_creds_key.encode()
conn.credentials_enc = encrypt_credentials(master_key, user_id, new_creds)
conn.status = "ACTIVE"
await session.commit()
return await operation_fn(new_creds)
raise
```
### Anti-Patterns to Avoid
- **Storing OAuth state in FastAPI process memory:** Multi-instance deployments will fail because the callback may arrive at a different instance than the one that created the state. Use Redis.
- **Reusing the HKDF instance:** The `cryptography` library raises `AlreadyFinalized` on second call to `.derive()`. Always create a new `HKDF` instance per key derivation.
- **Checking hostname string for SSRF, not resolved IP:** `validate_cloud_url("http://internal.corp")` would pass a string check but may resolve to `10.0.0.1`. Always resolve DNS and check the resulting IP.
- **Returning `credentials_enc` in any API response:** The `CloudConnectionOut` Pydantic model (already in `admin.py`) is the whitelist — use it for all cloud connection responses.
- **Calling cloud SDK methods from the async event loop without `asyncio.to_thread()`:** All cloud SDKs (`google-api-python-client`, `msal`, `webdavclient3`) are synchronous. Blocking the event loop kills throughput.
- **Using `prompt="consent"` only on first connect:** Without `prompt="consent"`, Google may not return a refresh token on reconnect if the app was previously authorized. Always pass `prompt="consent"` to guarantee a fresh refresh token.
- **Single cloud_connections row per user:** The schema supports multiple providers simultaneously (one row per provider per user, D-13). The upsert logic must match on `(user_id, provider)` not just `user_id`.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| OAuth2 PKCE + token exchange for Google | Custom HMAC/base64 code verifier | `google_auth_oauthlib.flow.Flow` | Handles RFC 7636 PKCE, redirect URI validation, and token serialization |
| OAuth2 for Microsoft Graph | Raw `requests` calls to login.microsoftonline.com | `msal.ConfidentialClientApplication` | MSAL handles token cache, `invalid_grant` detection, tenant routing, and PKCE |
| WebDAV PROPFIND XML | Raw `httpx` with hand-coded XML bodies | `webdavclient3.Client` | PROPFIND response parsing, multistatus handling, redirect following |
| Fernet encryption + key derivation | AES-GCM + custom key stretching | `cryptography` Fernet + HKDF | Fernet is misuse-resistant (authenticated encryption with IV, HMAC tag) — hand-rolled AES can fail silently |
| Private IP detection for SSRF | Regex on URL string | `ipaddress.ip_network().supernet_of()` | Python's `ipaddress` module handles IPv4/IPv6 edge cases including `::ffff:127.0.0.1` mapped addresses |
| In-memory TTL cache | `dict` with `asyncio.get_event_loop().time()` comparison | `cachetools.TTLCache` | TTLCache handles concurrent access with a lock, LRU eviction, and correct TTL semantics |
| OAuth state token validation | JWT with custom HMAC | Redis TTL key | Redis TTL provides natural expiry + single-use deletion; no new secret required |
**Key insight:** All cloud credential handling is a solved problem at the library level. The most common Phase 5 failure mode would be attempting to re-implement OAuth token exchange logic that edge cases around redirect URI matching, PKCE, and token format silently break.
---
## Common Pitfalls
### Pitfall 1: Google Refresh Token Only Issued Once
**What goes wrong:** User connects Google Drive; the first connection includes a refresh token. Later the user disconnects and reconnects. Google does not issue a new refresh token because the user already authorized the app — the re-authorization returns only an access token. Credentials are stored but the connection goes stale in 1 hour.
**Why it happens:** Google only issues a refresh token on the first authorization for a given client_id + user pair, or when `prompt="consent"` is explicitly passed.
**How to avoid:** Always pass `prompt="consent"` and `access_type="offline"` in `flow.authorization_url()`.
**Warning signs:** `credentials.refresh_token` is `None` after `flow.fetch_token()`.
### Pitfall 2: webdavclient3 Path Encoding for Nextcloud
**What goes wrong:** Nextcloud returns 404 or 207 Multi-Status with an empty propfind result for paths with spaces or non-ASCII characters when the path is not percent-encoded.
**Why it happens:** Nextcloud's WebDAV endpoint requires percent-encoded paths; webdavclient3 may or may not encode paths depending on the method called.
**How to avoid:** Use `urllib.parse.quote()` on all path segments before passing to webdavclient3 operations that accept raw paths. [ASSUMED — verify against webdavclient3 docs during implementation]
**Warning signs:** Works with ASCII-only filenames; fails with spaces or umlauts.
### Pitfall 3: HKDF AlreadyFinalized Error
**What goes wrong:** `cryptography.exceptions.AlreadyFinalized` is raised when `HKDF.derive()` is called a second time on the same instance.
**Why it happens:** HKDF is a one-shot operation by design in the `cryptography` library.
**How to avoid:** Create a new `HKDF(...)` instance inside `_derive_fernet_key()` on every call — never store or reuse the HKDF instance.
**Warning signs:** Works in unit tests (each test creates a fresh instance), fails under concurrent load or in repeated calls within the same request.
### Pitfall 4: OAuth Callback State Mismatch in Multi-Instance Deployment
**What goes wrong:** State token is stored in a Python dict in-process. The OAuth callback arrives at a different uvicorn instance → `invalid state` error.
**Why it happens:** HTTP requests are not session-sticky in a load-balanced deployment.
**How to avoid:** Store OAuth state in Redis (`app.state.redis`) with a 30-minute TTL. [VERIFIED: Redis already wired in codebase at `app.state.redis`]
**Warning signs:** OAuth works in single-instance Docker Compose but fails intermittently in production.
### Pitfall 5: DNS Rebinding Attack on SSRF Validation
**What goes wrong:** `validate_cloud_url` resolves `attacker.com` to `8.8.8.8` (passes validation), then the subsequent request resolves `attacker.com` to `169.254.169.254` (cloud metadata endpoint). The validation and the actual request see different IPs.
**Why it happens:** DNS TTL expires between validation and request; attacker controls the DNS.
**How to avoid:** Use `socket.create_connection` with the pre-validated IP directly (pin the IP), or document that a network-level egress firewall is the defense-in-depth layer for DNS rebinding. The `validate_cloud_url` utility call immediately before each request (not once at connect time) reduces the window. [CITED: cheatsheetseries.owasp.org]
**Warning signs:** SSRF test passes with direct IP inputs but might miss DNS-based attacks.
### Pitfall 6: Microsoft Graph Upload Size Limit
**What goes wrong:** Files larger than 4 MB fail with `413 Request Entity Too Large` when uploaded via a single PUT/POST to Microsoft Graph.
**Why it happens:** Microsoft Graph's simple upload endpoint is limited to 4 MB. Larger files require a resumable upload session (`createUploadSession`).
**How to avoid:** For Phase 5, implement resumable upload sessions for files > 4 MB. Use `POST /me/drive/root:/{path}:/createUploadSession` to get an upload URL, then upload in 10 MB chunks.
**Warning signs:** Tests with small files pass; production uploads of real documents (> 4 MB) fail silently or with 413.
### Pitfall 7: Google Drive file() Service is Synchronous
**What goes wrong:** `googleapiclient.discovery.build()` and all `service.files().xxx().execute()` calls are synchronous and block the event loop.
**Why it happens:** `google-api-python-client` was built before asyncio was standard.
**How to avoid:** Wrap every SDK call in `asyncio.to_thread()`. Do NOT await `service.files().list()` directly — it is not a coroutine.
**Warning signs:** FastAPI request handler completes quickly in tests but blocks under load.
---
## Code Examples
### Credential Round-Trip Test (CLOUD-02)
```python
# Source: based on cryptography.io HKDF docs [CITED: cryptography.io]
import base64
import json
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from cryptography.fernet import Fernet
def test_credential_encryption_round_trip():
master_key = b"test-master-key-32bytes-padded!!" # 32 bytes
user_id = "550e8400-e29b-41d4-a716-446655440000"
credentials = {"access_token": "ya29.xxx", "refresh_token": "1//xxx", "expiry": "2026-05-28T15:00:00"}
encrypted = encrypt_credentials(master_key, user_id, credentials)
assert isinstance(encrypted, str)
assert "access_token" not in encrypted # not plaintext
decrypted = decrypt_credentials(master_key, user_id, credentials)
assert decrypted == credentials
```
### SSRF Validation Test
```python
# Source: pattern derived from OWASP SSRF cheat sheet [CITED: cheatsheetseries.owasp.org]
import pytest
@pytest.mark.parametrize("url,should_raise", [
("http://localhost/dav", True),
("http://127.0.0.1/dav", True),
("http://169.254.169.254/dav", True),
("http://10.0.0.1/dav", True),
("http://192.168.1.1/dav", True),
("http://172.16.0.1/dav", True),
("https://nextcloud.example.com/remote.php/dav", False),
("http://::1/dav", True),
])
def test_ssrf_validation(url, should_raise):
if should_raise:
with pytest.raises(ValueError):
validate_cloud_url(url)
else:
validate_cloud_url(url) # no exception
```
### CloudConnectionOut Whitelist Enforcement
```python
# Source: backend/api/admin.py (VERIFIED: codebase)
# The CloudConnectionOut model already exists in admin.py.
# ALL cloud connection endpoints must use this model, not CloudConnection ORM directly.
class CloudConnectionOut(BaseModel):
id: str
provider: str
display_name: str
status: str
connected_at: datetime
model_config = {"from_attributes": True}
# Usage in cloud.py:
@router.get("/api/cloud/connections")
async def list_connections(
current_user: User = Depends(get_regular_user),
session: AsyncSession = Depends(get_db),
) -> dict:
result = await session.execute(
select(CloudConnection).where(CloudConnection.user_id == current_user.id)
)
connections = result.scalars().all()
return {"items": [CloudConnectionOut.model_validate(c).model_dump() for c in connections]}
```
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| Storing OAuth state in Flask/FastAPI session (in-memory) | Redis TTL-keyed state tokens | ~2022 with multi-instance deployments becoming standard | Multi-instance safety; prevents token fixation |
| webdav-client-python (original) | webdavclient3 (fork, actively maintained) | 2018 | webdav-client-python is unmaintained; webdavclient3 is the maintained fork |
| `google.oauth2.credentials.Credentials` with service accounts | `google-auth-oauthlib` Flow for user-delegated access | 2019 | Service accounts require GSuite domain; user OAuth is required for personal Drive |
| ADAL (Azure Active Directory Authentication Library) for Python | MSAL (Microsoft Authentication Library) | 2020; ADAL deprecated | ADAL end-of-life June 2023; MSAL is the replacement |
| Using `Fernet.generate_key()` with user passwords | HKDF + Fernet (key derivation before Fernet) | Ongoing best practice | Fernet keys must be 32 random bytes; `generate_key()` generates fresh random keys, not deterministic per-user keys |
**Deprecated/outdated:**
- `adal` Python package: End-of-life; replaced by `msal`. Do NOT use.
- `webdav-client-python` (without the `3`): Unmaintained since ~2018. Use `webdavclient3`.
- `google.oauth2.service_account.Credentials`: For service accounts, not user-delegated Drive access. Wrong tool for this use case.
---
## Assumptions Log
| # | Claim | Section | Risk if Wrong |
|---|-------|---------|---------------|
| A1 | `webdavclient3` uses `upload_to` / `download_from` method names for stream-based operations | Architecture Patterns Pattern 5 | Planner must verify method signatures against installed package; wrong method names cause `AttributeError` at test time |
| A2 | Google Drive `googleapiclient.errors.HttpError` status 401 is the token-expiry signal | Pattern 10: On-Demand Token Refresh | Actual exception class may differ; must verify during implementation with a real expired token |
| A3 | Microsoft Graph `invalid_grant` error appears in `result["error"]` from `msal.acquire_token_by_refresh_token` | Pattern 10 | MSAL may use a different error field or raise an exception; verify against msal docs |
| A4 | `webdavclient3` percent-encodes paths automatically | Pitfall 2 | May require manual encoding; verify during WebDAV backend implementation |
| A5 | `tenant_id="common"` works for both personal OneDrive and organizational accounts | Pattern 4: MSAL | May require `"consumers"` for personal accounts; verify against Microsoft docs for the target use case |
---
## Open Questions (RESOLVED)
1. **Google Drive object key scheme for `stat_object`**
- What we know: MinIO `stat_object` returns size in bytes from the storage layer. Google Drive returns file metadata including `size` from `files.get(fileId, fields='size')`.
- What's unclear: Google Drive may not return `size` for Google Workspace files (Docs, Sheets, Slides) since they have no binary size. DocuVault uploads binary files, so this may not be an issue in practice.
- Recommendation: Implement `stat_object` using `service.files().get(fileId=object_key, fields="size").execute()` and return `int(metadata["size"])`. Add a fallback of `0` for files without a size.
- **RESOLVED:** Use `service.files().get(fileId=object_key, fields="size").execute()` and return `int(metadata.get("size", 0))`. DocuVault only uploads binary files so the `0` fallback handles edge cases without breaking functionality.
2. **Nextcloud folder listing path convention**
- What we know: Nextcloud WebDAV base path is typically `/remote.php/dav/files/{username}/`.
- What's unclear: Whether the `webdavclient3` `Client` automatically handles the `/remote.php/dav/files/{username}/` prefix or whether it must be included in the `server_url`.
- Recommendation: Store `server_url` as the full WebDAV root (e.g., `https://nc.example.com/remote.php/dav/files/alice/`) and use relative paths within it. Test with PROPFIND on the root to validate the connection (D-08).
- **RESOLVED:** `server_url` stores the full WebDAV root including the `/remote.php/dav/files/{username}/` prefix. All relative paths within WebDAVBackend and NextcloudBackend are appended to this base. Connection validation uses a PROPFIND on the root path per D-08.
3. **Microsoft Graph upload for files > 4 MB**
- What we know: Simple upload (PUT `/me/drive/root:/{path}:/content`) is limited to 4 MB. Resumable sessions handle larger files.
- What's unclear: The Phase 5 plan should specify whether to implement resumable sessions upfront or use a 4 MB size gate.
- Recommendation: Implement resumable upload session (`createUploadSession`) for all files to avoid the hard limit. It handles both small and large files without a size check.
- **RESOLVED:** Implement `createUploadSession` for ALL file sizes (no size gate). `CHUNK_SIZE = 10 * 1024 * 1024` (10 MB, above Graph 4 MB limit) used in all OneDrive uploads. Pitfall 6 documented in Common Pitfalls section.
---
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| Python 3.12 (Docker) | All backends | In Docker container | 3.12.x | — |
| Redis | OAuth state storage | In Docker Compose | 6.x+ | — |
| PostgreSQL | cloud_connections table | In Docker Compose | 15.x | — |
| `cryptography` package | Credential encryption | NOT in requirements.txt | — | Must be added (48.0.0 verified) |
| `google-auth-oauthlib` | Google Drive OAuth | NOT in requirements.txt | — | Must be added (1.3.1 verified) |
| `google-api-python-client` | Google Drive API | NOT in requirements.txt | — | Must be added (2.196.0 verified) |
| `msal` | OneDrive OAuth | NOT in requirements.txt | — | Must be added (1.36.0 verified) |
| `webdavclient3` | WebDAV/Nextcloud | NOT in requirements.txt | — | Must be added (3.14.7 verified) |
| `cachetools` | Folder listing cache | NOT in requirements.txt | — | Must be added (6.2.6 verified) |
| Google OAuth App (Azure/GCP console) | Google Drive integration | NOT CONFIGURED | — | Must be created by user; client_id/client_secret added to .env |
| Microsoft App Registration (Azure portal) | OneDrive integration | NOT CONFIGURED | — | Must be created by user; client_id/client_secret/tenant_id added to .env |
**Missing dependencies with no fallback:**
- `cryptography`, `google-auth-oauthlib`, `google-api-python-client`, `msal`, `webdavclient3`, `cachetools` — must be added to `requirements.txt` before any cloud backend code runs.
**Missing dependencies with fallback (soft):**
- Google OAuth App credentials: Integration tests for Google Drive will need mocked OAuth flows if real GCP app is not configured. Unit tests can mock the entire OAuth flow.
- Microsoft App Registration: Same as above for OneDrive.
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | pytest + pytest-asyncio (already in requirements.txt) |
| Config file | `backend/pytest.ini` (already exists) |
| Quick run command | `cd backend && pytest tests/test_cloud.py -x -v` |
| Full suite command | `cd backend && pytest -v` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| CLOUD-01 | User can connect all 4 providers | Integration | `pytest tests/test_cloud.py::test_connect_google_drive -x` | ❌ Wave 0 |
| CLOUD-01 | OAuth callback validates state and saves connection | Integration | `pytest tests/test_cloud.py::test_oauth_callback_valid_state -x` | ❌ Wave 0 |
| CLOUD-01 | Invalid OAuth state returns 400 | Integration | `pytest tests/test_cloud.py::test_oauth_callback_invalid_state -x` | ❌ Wave 0 |
| CLOUD-01 | WebDAV/Nextcloud connection validated before save (D-08) | Integration | `pytest tests/test_cloud.py::test_webdav_connect_validates -x` | ❌ Wave 0 |
| CLOUD-02 | Credential encryption/decryption round-trip | Unit | `pytest tests/test_cloud.py::test_credential_round_trip -x` | ❌ Wave 0 |
| CLOUD-02 | `credentials_enc` not in any API response (SEC-08) | Integration | `pytest tests/test_cloud.py::test_credentials_enc_not_exposed -x` | ❌ Wave 0 |
| CLOUD-03 | Upload to cloud folder goes through FastAPI (not presigned URL) | Integration | `pytest tests/test_cloud.py::test_cloud_upload_no_presigned -x` | ❌ Wave 0 |
| CLOUD-04 | Connection status displayed correctly | Integration | `pytest tests/test_cloud.py::test_connection_status_display -x` | ❌ Wave 0 |
| CLOUD-05 | `invalid_grant` → `REQUIRES_REAUTH` transition | Integration | `pytest tests/test_cloud.py::test_invalid_grant_sets_requires_reauth -x` | ❌ Wave 0 |
| CLOUD-06 | Disconnect permanently deletes credentials | Integration | `pytest tests/test_cloud.py::test_disconnect_deletes_credentials -x` | ❌ Wave 0 |
| CLOUD-07 | StorageBackend factory returns correct type | Unit | `pytest tests/test_cloud.py::test_factory_returns_correct_backend -x` | ❌ Wave 0 |
| D-17 | SSRF validation blocks RFC-1918 and loopback | Unit | `pytest tests/test_cloud.py::test_ssrf_validation -x` | ❌ Wave 0 |
| D-17 | SSRF validation blocks 169.254.x link-local | Unit | `pytest tests/test_cloud.py::test_ssrf_link_local -x` | ❌ Wave 0 |
| SEC | Admin cannot access cloud connection credentials | Integration | `pytest tests/test_cloud.py::test_admin_cannot_see_credentials -x` | ❌ Wave 0 |
| SEC | Cross-user cloud connection access returns 404 | Integration | `pytest tests/test_cloud.py::test_cross_user_idor -x` | ❌ Wave 0 |
### Sampling Rate
- **Per task commit:** `cd backend && pytest tests/test_cloud.py -x -v`
- **Per wave merge:** `cd backend && pytest -v`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `backend/tests/test_cloud.py` — all Phase 5 tests (unit + integration), starting with xfail stubs
- [ ] New conftest fixtures: `mock_google_drive_creds`, `mock_onedrive_creds`, `mock_webdav_client`, `cloud_connection_factory`
---
## Security Domain
### Applicable ASVS Categories
| ASVS Category | Applies | Standard Control |
|---------------|---------|-----------------|
| V2 Authentication | yes | OAuth2 state CSRF; per-session token; `get_regular_user` dep on all cloud endpoints |
| V3 Session Management | yes | OAuth state token is single-use; stored in Redis with TTL; deleted after callback |
| V4 Access Control | yes | Every `/api/cloud/*` endpoint asserts `connection.user_id == current_user.id` before operations |
| V5 Input Validation | yes | `validate_cloud_url()` for WebDAV/Nextcloud; Pydantic models for all request bodies; no raw string interpolation in URLs |
| V6 Cryptography | yes | HKDF + Fernet for credential encryption; AES-256 via `cryptography` library (never hand-rolled) |
| V7 Error Handling | yes | `invalid_grant` handled explicitly (D-06); no stack traces in cloud API error responses |
### Known Threat Patterns for OAuth + Cloud Storage
| Pattern | STRIDE | Standard Mitigation |
|---------|--------|---------------------|
| CSRF on OAuth callback | Tampering | `state` parameter validated via Redis; state token is `secrets.token_urlsafe(32)` |
| SSRF via WebDAV/Nextcloud URL | Tampering / Information Disclosure | `validate_cloud_url()` at connect-time and before each request; `ipaddress` module DNS resolution check |
| Credential exposure via API leak | Information Disclosure | `CloudConnectionOut` Pydantic whitelist; `credentials_enc` excluded by omission |
| Token replay via OAuth state | Elevation of Privilege | Redis single-use deletion after callback; 30-minute TTL prevents stale states |
| Cross-user cloud connection access | IDOR | `connection.user_id == current_user.id` assertion on every operation; 404 not 403 |
| Unverified credentials stored (D-08) | Information Disclosure / DoS | PROPFIND/OPTIONS validation before storage; error returned on failure |
| Refresh token theft from DB | Information Disclosure | `credentials_enc` is Fernet-encrypted with HKDF per-user key; master key in env var only |
| Admin accessing user cloud credentials | Broken Access Control | `get_regular_user` dep blocks admin (403); `CloudConnectionOut` whitelist on all responses |
| DNS rebinding SSRF bypass | Tampering | `validate_cloud_url()` called immediately before each outbound request (not only at connect-time); documented defense-in-depth via network egress firewall |
---
## Project Constraints (from CLAUDE.md)
The following CLAUDE.md directives are binding for Phase 5:
- JWT access token lives in Pinia memory only — never localStorage or sessionStorage (OAuth callback must redirect to Vue with a query param, not embed tokens in the URL)
- Cloud credentials encrypted with HKDF per-user key derivation — master key in env var only
- Admin endpoints never return `credentials_enc`
- Every cloud connection endpoint asserts `resource.user_id == current_user.id`
- All DB queries via ORM / parameterized statements — zero raw string interpolation
- `get_regular_user` on all cloud connection endpoints (admin blocked from this surface)
- `write_audit_log()` called on cloud connect, disconnect, and re-auth events
- Testing protocol: every new function, endpoint, and component must have at least one test; `pytest -v` must pass zero failures
- Security gate: `bandit -r backend/`, `pip audit`, `npm audit --audit-level=high` must all pass before phase advancement
- Bug fix rule: root cause only, ≤50 lines, regression test required
---
## Sources
### Primary (HIGH confidence)
- `backend/storage/base.py` — StorageBackend ABC, 7 abstract methods, exact signatures
- `backend/storage/minio_backend.py` — asyncio.to_thread() wrapping pattern, error handling shape
- `backend/storage/__init__.py` — factory pattern to extend
- `backend/db/models.py` — CloudConnection model fields, Document.storage_backend, User.default_storage_backend
- `backend/api/admin.py` — CloudConnectionOut Pydantic whitelist pattern (already exists)
- `backend/main.py` — Redis wiring on app.state.redis, lifespan pattern
- `backend/deps/auth.py` — get_regular_user, get_current_user patterns
- `backend/migrations/versions/0001_initial_schema.py` — confirmed cloud_connections table, storage_backend columns
- [cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/](https://cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/) — HKDF usage and info parameter
- [cryptography.io/en/latest/fernet/](https://cryptography.io/en/latest/fernet/) — Fernet key format
- [googleapis.dev/python/google-auth-oauthlib/latest](https://googleapis.dev/python/google-auth-oauthlib/latest/reference/google_auth_oauthlib.flow.html) — Flow class API
- PyPI `pip download` — confirmed versions: cryptography-48.0.0, google_auth_oauthlib-1.3.1, google_api_python_client-2.196.0, msal-1.36.0, webdavclient3-3.14.7, cachetools-6.2.6
- slopcheck 0.6.1 — all 7 packages rated [OK]
### Secondary (MEDIUM confidence)
- [learn.microsoft.com/en-us/entra/msal/python/](https://learn.microsoft.com/en-us/entra/msal/python/) — MSAL Python overview and authorization code flow
- [cachetools.readthedocs.io](https://cachetools.readthedocs.io/en/stable/) — TTLCache thread safety requirement
- [cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html](https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html) — DNS resolution-based SSRF check
### Tertiary (LOW confidence / ASSUMED)
- webdavclient3 specific method names (`upload_to`, `download_from`) — marked [ASSUMED] above; verify during implementation
- Exact Microsoft Graph error field for `invalid_grant` in MSAL — marked [ASSUMED] above
---
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH — all packages verified on PyPI, slopcheck clean, versions confirmed
- Architecture: HIGH — built directly from codebase inspection; ABC, factory, CloudConnection model, Redis wiring all verified
- OAuth2 flows: MEDIUM/HIGH — google-auth-oauthlib Flow API verified via official docs; MSAL pattern confirmed via Microsoft docs
- Pitfalls: HIGH — based on official library docs and known OAuth edge cases
- SSRF prevention: HIGH — Python stdlib ipaddress module; OWASP-cited approach
**Research date:** 2026-05-28
**Valid until:** 2026-06-28 (30 days) — package versions are stable but verify before pinning in requirements.txt