Files
curo1305 d13801538d fix(05): revise Phase 5 plans based on checker feedback — B1-B4, W1-W4
B1: Mark RESEARCH.md Open Questions as (RESOLVED) with decision text for all 3
B2: Backends now stateless — raise CloudConnectionError(reason=) only; API layer
    in cloud.py owns token refresh + DB update via _call_cloud_op helper
B3: Add Task 3 to Plan 05 — cloud connection + object cleanup on account deletion (SEC-09)
B4: Add frontend_url setting to Plan 01 Task 1; Plan 05 uses settings.frontend_url
    for OAuth callback redirects
W1: ROADMAP.md Phase 5 now correctly labels Plans 03+04 as Wave 3 (not Wave 2)
W2: Plan 06 invalid_grant test now asserts both 503 HTTP response AND DB REQUIRES_REAUTH
W3: Plan 06 Task 2 split into unit tests (4, cloud_utils.py) and integration tests (11, HTTP)
W4: Plan 07 adds Vitest tests for cloudConnections store (4 tests) and SettingsCloudTab
    mount test (2 tests) per CLAUDE.md testing protocol

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 19:55:28 +02:00

991 lines
59 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 5: Cloud Storage Backends — Research
**Researched:** 2026-05-28
**Domain:** OAuth2 cloud provider integration, WebDAV/Nextcloud, credential encryption, SSRF prevention, StorageBackend ABC extension
**Confidence:** HIGH (all package versions verified on PyPI; patterns verified against official docs and codebase)
---
<user_constraints>
## User Constraints (from CONTEXT.md)
### Locked Decisions
- **D-01:** All 4 providers (OneDrive/Microsoft Graph, Google Drive v3, Nextcloud, WebDAV) delivered in this single phase.
- **D-02:** Each provider is a concrete `StorageBackend` subclass in `backend/storage/` (e.g., `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`).
- **D-03:** FastAPI owns the OAuth callback. Flow: user clicks "Connect" → provider OAuth consent page → `GET /api/cloud/oauth/callback/{provider}?code=…&state=…` → FastAPI exchanges code, encrypts credentials, saves to `cloud_connections`, then redirects browser to Vue settings page with `?cloud_connected=google_drive` (or `?cloud_error=…`). Auth code and tokens never land in the frontend.
- **D-04:** OAuth state parameter encodes the authenticated user's ID (signed or encrypted) using `secrets.token_urlsafe(32)` + a short-lived server-side state store (Redis or DB) to validate the callback matches the initiating user session.
- **D-05:** Access token refresh is on-demand and transparent. When a cloud API call fails with token-expiry (HTTP 401), the backend catches it, uses the stored refresh token, updates `credentials_enc` in DB, and retries the original call within the same request.
- **D-06:** If the refresh token is rejected by the provider (`invalid_grant`), the connection status transitions to `REQUIRES_REAUTH` and the request returns an error telling the user to reconnect. No silent failure.
- **D-07:** UI presents both auth methods for Nextcloud/WebDAV (real account password and app-specific password) with clear recommendation for app password.
- **D-08:** On save, backend validates the WebDAV/Nextcloud connection (lightweight PROPFIND or OPTIONS request) before storing credentials. If validation fails, return an error — never store unverified credentials.
- **D-09:** Sidebar shows local MinIO folders first, then each connected cloud provider as a peer top-level node. Lazy-load one level at a time.
- **D-10:** Upload destination follows the active folder context. Cloud uploads go through FastAPI intermediary — no direct browser-to-cloud.
- **D-11:** Existing MinIO documents stay in MinIO — no migration. `storage_backend="minio"` for existing docs; `"google_drive"`, `"onedrive"`, etc. for new cloud docs.
- **D-12:** Cloud provider management lives in a new "Cloud Storage" tab in SettingsView.
- **D-13:** Multiple cloud providers can be connected simultaneously (one row per provider in `cloud_connections`).
- **D-14:** Cloud backends: `generate_presigned_put_url` raises `NotImplementedError`. Upload endpoint detects cloud backends and uses direct upload path.
- **D-15:** Downloads/previews use the same `GET /api/documents/{id}/content` proxy endpoint regardless of backend. Calls `storage_backend.get_object(document.object_key)` and streams bytes to browser.
- **D-16:** Cloud folder tree browsing is live API calls with a 60-second in-memory TTL cache (keyed by `user_id + provider + folder_path`). Not Redis — in-memory is sufficient.
- **D-17:** All outbound HTTP to WebDAV/Nextcloud validates URL against SSRF blocklist (localhost, 127.x, 169.254.x, RFC 1918, ::1). Validation in a shared `validate_cloud_url()` utility called before every request.
- **D-18:** `credentials_enc` encrypted with `HKDF(CLOUD_CREDS_KEY, salt=user_id_bytes, info=b"cloud-credentials")`. Master key in `CLOUD_CREDS_KEY` env var. Never stored unencrypted. Never returned in any API response.
- **D-19:** Admin API responses for cloud connections return only `provider, display_name, connected_at, status` (CloudConnectionOut Pydantic whitelist pattern from Phase 4).
### Claude's Discretion
- Choice of Python OAuth client library for Google Drive and OneDrive (e.g., `google-auth-oauthlib`, `msal`).
- Choice of WebDAV Python library (e.g., `webdavclient3`, `aiohttp` with manual PROPFIND).
- Exact TTL cache implementation (dict + timestamp vs. `cachetools.TTLCache`).
- OAuth state store implementation (Redis vs. short-lived DB row vs. signed JWT).
### Deferred Ideas (OUT OF SCOPE)
- Document migration between backends (user-initiated move of MinIO docs to cloud).
- Cloud-native resumable upload URLs (provider-specific presigned upload sessions).
- Shared cloud storage (team/organization).
- Cloud folder sync / offline cache.
- Email notifications on REQUIRES_REAUTH.
</user_constraints>
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|------------------|
| CLOUD-01 | User can connect OneDrive (Microsoft Graph), Google Drive (v3 API), Nextcloud, or generic WebDAV as a personal storage backend | MSAL + google-auth-oauthlib OAuth2 flows; webdavclient3 for WebDAV/Nextcloud |
| CLOUD-02 | Cloud OAuth credentials encrypted using HKDF per-user key derivation (`HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials")`); master key in `CLOUD_CREDS_KEY` env var | `cryptography` library HKDF + Fernet pattern documented |
| CLOUD-03 | Local MinIO storage and connected cloud backends coexist; user can select their default storage destination | `documents.storage_backend` column already in schema; `users.default_storage_backend` column already present |
| CLOUD-04 | Each cloud connection displays status: `ACTIVE | REQUIRES_REAUTH | ERROR` | `CloudConnection.status` column already in schema |
| CLOUD-05 | On OAuth revocation (`invalid_grant`), connection status transitions to `REQUIRES_REAUTH` — surfaced to user, not retried silently | On-demand token refresh pattern with `invalid_grant` catch documented |
| CLOUD-06 | User can disconnect a cloud backend; credentials are permanently deleted from the DB | `DELETE /api/cloud/connections/{id}` with ownership check |
| CLOUD-07 | Storage backend abstracted via `StorageBackend` ABC + factory in `storage/` module (mirrors existing `ai/` provider pattern) | ABC already exists with 7 abstract methods; factory already in `storage/__init__.py` |
</phase_requirements>
---
## Summary
Phase 5 extends DocuVault's existing storage abstraction with four cloud provider backends. The infrastructure is largely pre-built: the `StorageBackend` ABC with 7 abstract methods already exists (`backend/storage/base.py`), the `cloud_connections` table with all required columns (`id`, `user_id`, `provider`, `credentials_enc`, `status`, `connected_at`) was created in migration 0001, the `documents.storage_backend` column already exists, and `users.default_storage_backend` already exists. No new Alembic migration is needed for the data model.
The three main implementation challenges are: (1) the OAuth2 callback flow where FastAPI owns both the initiation and code-exchange, (2) per-user HKDF credential encryption using the `cryptography` library (which is **not currently in `requirements.txt`** and must be added), and (3) SSRF prevention for user-supplied WebDAV/Nextcloud URLs using Python's built-in `ipaddress` module. Redis is already wired on `app.state.redis` and is the correct choice for OAuth state storage (TTL-backed, eliminates race conditions in multi-instance deployments, already proven pattern in auth.py for TOTP replay prevention).
The WebDAV/Nextcloud backends should use `webdavclient3` wrapped in `asyncio.to_thread()` (matching the MinIOBackend pattern) rather than an async-native library — `webdavclient3` is the most mature option (8+ years old, actively maintained) and its sync API is well-documented. Google Drive uses `google-api-python-client` + `google-auth-oauthlib`; OneDrive uses `msal` with the authorization code flow. Both sync SDKs wrap in `asyncio.to_thread()`.
**Primary recommendation:** Add `cryptography>=41.0.0`, `google-auth-oauthlib>=1.3.1`, `google-api-python-client>=2.196.0`, `msal>=1.36.0`, and `webdavclient3>=3.14.7` to `requirements.txt`. Implement OAuth state via Redis TTL (30-minute expiry). Use `cachetools.TTLCache` (already available on PyPI, version 6.2.6 verified) for the 60-second folder listing cache. Use Python's built-in `ipaddress` module for SSRF URL validation — no additional library needed.
---
## Architectural Responsibility Map
| Capability | Primary Tier | Secondary Tier | Rationale |
|------------|-------------|----------------|-----------|
| OAuth2 initiation (redirect URL generation) | API / Backend | — | Secrets (client_id, client_secret) must never reach the browser |
| OAuth2 callback code exchange | API / Backend | — | Auth code + client_secret exchange is a server-to-server operation (D-03) |
| OAuth state CSRF validation | API / Backend (Redis) | — | State token must be stored server-side and expire after use (D-04) |
| Credential encryption/decryption | API / Backend | — | HKDF master key lives in env var; decryption happens at API layer only |
| Cloud file upload | API / Backend | Cloud Provider API | Bytes pass through FastAPI intermediary — no direct browser-to-cloud (D-10) |
| Cloud file download/preview | API / Backend | Cloud Provider API | Same proxy endpoint as MinIO (D-15) |
| Cloud folder tree listing | API / Backend | Cloud Provider API | Lazy-load, TTL-cached in FastAPI app state (D-16) |
| SSRF validation | API / Backend | — | Must run before every outbound HTTP call; not frontend-accessible (D-17) |
| Connection status display | Frontend / Client | — | UI reads `status` field from API; no direct cloud calls from browser |
| Cloud Storage settings tab | Frontend / Client | — | New tab in SettingsView; reads/writes via `/api/cloud/connections` |
| On-demand token refresh | API / Backend | — | Transparent to user; handled within the request lifecycle (D-05) |
| Default storage backend selection | API / Backend + DB | Frontend / Client | `users.default_storage_backend` column; UI reads/writes via settings endpoint |
---
## Standard Stack
### Core (new additions to requirements.txt)
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| `cryptography` | 48.0.0 | HKDF key derivation + Fernet encryption for `credentials_enc` | The only Python library with official HKDF + Fernet in one package; already referenced in CLAUDE.md |
| `google-auth-oauthlib` | 1.3.1 | Google OAuth2 authorization code flow; `Flow` class manages URL generation and code exchange | Official Google library; listed in Google's own Python quickstart |
| `google-api-python-client` | 2.196.0 | Google Drive v3 API (files.get, files.create, files.delete, files.list) | Official Google library; required alongside google-auth-oauthlib for Drive operations |
| `msal` | 1.36.0 | Microsoft Authentication Library — authorization code flow for OneDrive/Microsoft Graph | Official Microsoft library; only sanctioned way to obtain Microsoft Graph tokens |
| `webdavclient3` | 3.14.7 | WebDAV operations (PROPFIND, upload, download, delete) for both Nextcloud and generic WebDAV | Mature (8 years), actively maintained, supports Nextcloud and all standard WebDAV servers |
| `cachetools` | 6.2.6 | `TTLCache` for 60-second folder listing cache in FastAPI app state (D-16) | Standard cache library; pure Python; no new infrastructure dependency |
[VERIFIED: npm registry / PyPI] — all versions confirmed via `pip download` against PyPI registry.
### Already in requirements.txt (relevant to Phase 5)
| Library | Current Version Spec | Phase 5 Use |
|---------|---------------------|-------------|
| `httpx` | >=0.27 | Microsoft Graph REST calls (aiohttp alternative); already used for HIBP |
| `redis` | >=4.6.0 | OAuth state storage (TTL-keyed state tokens, already on `app.state.redis`) |
| `aioredis` | via `redis[asyncio]` | Already wired in `main.py` lifespan |
| `pydantic` | >=2.0 | Request/response models for new cloud endpoints |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| `webdavclient3` | `aiohttp` + raw PROPFIND XML | webdavclient3 handles XML parsing, redirect following, and auth headers; raw aiohttp requires implementing RFC 4918 manually |
| `webdavclient3` | `aiodav` / `aiowebdav2` | These async WebDAV libs are very new (< 2 years old, low download counts); webdavclient3 wrapped in `asyncio.to_thread()` matches the MinIOBackend pattern and is safer |
| `msal` (for OneDrive) | `requests-oauthlib` + raw Graph calls | MSAL handles token refresh, token cache, and `invalid_grant` detection natively |
| `cachetools.TTLCache` | `dict` + timestamp | TTLCache has automatic expiry and LRU eviction; manual dict+timestamp requires cleanup logic; both work, TTLCache is cleaner |
| Redis for OAuth state | Signed JWT state | Redis is already wired; TTL-keyed Redis entries are the proven pattern (auth.py TOTP replay prevention). Signed JWT state is viable but requires HMAC secret management for state-only tokens |
**Installation:**
```bash
# Add to backend/requirements.txt
cryptography>=41.0.0
google-auth-oauthlib>=1.3.1
google-api-python-client>=2.196.0
msal>=1.36.0
webdavclient3>=3.14.7
cachetools>=5.3.0
```
**Version verification:** Confirmed against PyPI via `pip download`:
- `cryptography-48.0.0``[VERIFIED: PyPI]`
- `google_auth_oauthlib-1.3.1``[VERIFIED: PyPI]`
- `google_api_python_client-2.196.0``[VERIFIED: PyPI]`
- `msal-1.36.0``[VERIFIED: PyPI]`
- `webdavclient3-3.14.7``[VERIFIED: PyPI]`
- `cachetools-6.2.6``[VERIFIED: PyPI]`
---
## Package Legitimacy Audit
All packages verified via slopcheck 0.6.1 (run 2026-05-28):
| Package | Registry | Age | Downloads | Source Repo | slopcheck | Disposition |
|---------|----------|-----|-----------|-------------|-----------|-------------|
| `cryptography` | PyPI | 12+ yrs | 100M+/wk | github.com/pyca/cryptography | [OK] | Approved |
| `google-auth-oauthlib` | PyPI | 7+ yrs | 50M+/wk | github.com/googleapis/google-auth-library-python-oauthlib | [OK] | Approved |
| `google-api-python-client` | PyPI | 10+ yrs | 30M+/wk | github.com/googleapis/google-api-python-client | [OK] — note: "Name ends with '-client' — looks like LLM bait but package is established" | Approved |
| `msal` | PyPI | 6+ yrs | 10M+/wk | github.com/AzureAD/microsoft-authentication-library-for-python | [OK] | Approved |
| `webdavclient3` | PyPI | 8+ yrs | 200K+/wk | github.com/CloudPolis/webdavclient3 | [OK] | Approved |
| `cachetools` | PyPI | 10+ yrs | 80M+/wk | github.com/tkem/cachetools | [OK] | Approved |
**Packages removed due to slopcheck [SLOP] verdict:** none
**Packages flagged as suspicious [SUS]:** none
---
## Architecture Patterns
### System Architecture Diagram
```
Browser (Vue 3)
│ Click "Connect Google Drive"
[GET /api/cloud/oauth/initiate/google_drive]
│ 1. Generate state_token = secrets.token_urlsafe(32)
│ 2. Store Redis: oauth_state:{state_token} = user_id (TTL 30 min)
│ 3. Build authorization_url via google_auth_oauthlib.Flow
│ 4. HTTP 302 redirect → Google OAuth consent page
Google OAuth Consent Page (browser)
│ User approves
│ Google redirects to:
[GET /api/cloud/oauth/callback/google_drive?code=...&state=...]
│ 1. Validate state → lookup Redis oauth_state:{state} → get user_id
│ 2. Delete Redis key (prevent replay)
│ 3. Exchange code → tokens via flow.fetch_token()
│ 4. Serialize credentials (access_token, refresh_token, expiry)
│ 5. Encrypt with HKDF-derived per-user Fernet key
│ 6. Save/upsert cloud_connections row (user_id, provider, credentials_enc, status=ACTIVE)
│ 7. HTTP 302 redirect → Vue /settings?cloud_connected=google_drive
Vue SettingsView (onMounted)
│ Reads ?cloud_connected=google_drive
│ Shows success toast
[GET /api/cloud/connections]
│ Lists all cloud connections for current user
│ Returns CloudConnectionOut (no credentials_enc)
Browser renders Cloud Storage tab with connection status badges
─────── Document Upload to Cloud Folder ───────
Browser (Vue 3)
│ User is viewing Google Drive folder node
│ Drops file
[POST /api/documents/upload]
│ active folder context = cloud folder (provider=google_drive, folder_id=...)
│ 1. Load CloudConnection for user + provider
│ 2. Decrypt credentials_enc → Fernet key → credentials dict
│ 3. Check token expiry → if expired, refresh transparently (D-05)
│ 4. Call google_drive_backend.put_object(user_id, doc_id, bytes, ext, ct)
│ └── asyncio.to_thread → drive.files().create(...)
│ 5. Save Document(storage_backend="google_drive", object_key=drive_file_id)
Browser shows upload progress (same UploadProgress component)
─────── Document Download from Cloud ───────
[GET /api/documents/{id}/content]
│ 1. Load Document → storage_backend = "google_drive"
│ 2. get_storage_backend("google_drive", user_id, session) → GoogleDriveBackend
│ 3. backend.get_object(object_key) → bytes
│ 4. StreamingResponse to browser
Browser renders PDF in existing DocumentPreviewModal
─────── WebDAV/Nextcloud Connection ───────
Browser
│ User submits server_url + username + password (or app password)
[POST /api/cloud/connections/webdav]
│ 1. validate_cloud_url(server_url) → SSRF check (ipaddress module)
│ 2. Test connection: PROPFIND server_url (lightweight)
│ 3. If success: encrypt credentials → save cloud_connections
│ 4. If fail: 422 with error message (D-08)
Browser shows ACTIVE status badge
```
### Recommended Project Structure
```
backend/storage/
├── base.py # existing StorageBackend ABC (7 abstract methods)
├── __init__.py # extend get_storage_backend() factory
├── minio_backend.py # existing reference implementation
├── google_drive_backend.py # new: Google Drive v3
├── onedrive_backend.py # new: Microsoft Graph / OneDrive
├── nextcloud_backend.py # new: Nextcloud (WebDAV + status endpoint)
├── webdav_backend.py # new: generic WebDAV
└── cloud_utils.py # new: validate_cloud_url(), encrypt_credentials(), decrypt_credentials()
backend/api/
└── cloud.py # new: all /api/cloud/* endpoints
backend/services/
└── cloud_cache.py # new: TTLCache singleton for folder listings
backend/tests/
└── test_cloud.py # new: all Phase 5 tests
```
### Pattern 1: StorageBackend ABC Contract (7 methods)
The existing ABC requires all 7 methods. Cloud backends raise `NotImplementedError` for `generate_presigned_put_url` per D-14:
```python
# Source: backend/storage/base.py (verified in codebase)
class StorageBackend(ABC):
@abstractmethod
async def put_object(self, user_id, document_id, file_bytes, extension, content_type) -> str: ...
@abstractmethod
async def get_object(self, object_key: str) -> bytes: ...
@abstractmethod
async def delete_object(self, object_key: str) -> None: ...
@abstractmethod
async def presigned_get_url(self, object_key: str, expires_minutes: int = 60) -> str: ...
@abstractmethod
async def health_check(self) -> bool: ...
@abstractmethod
async def generate_presigned_put_url(self, object_key: str, expires_minutes: int = 15) -> str: ...
@abstractmethod
async def stat_object(self, object_key: str) -> int: ...
```
Cloud backends implement all 7. For `generate_presigned_put_url` and `presigned_get_url`, cloud backends raise `NotImplementedError` — the upload endpoint detects cloud backends and uses the direct path (D-14). For `stat_object`, cloud backends return file size from the provider's metadata response.
The `object_key` for cloud backends is the **provider's native file ID** (e.g., Google Drive file ID, OneDrive item ID, WebDAV path). The STORE-02 key schema (`{user_id}/{document_id}/{uuid4()}{ext}`) applies only to MinIO.
### Pattern 2: HKDF + Fernet Credential Encryption
```python
# Source: cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/
# [VERIFIED: CITED: cryptography.io]
import base64
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from cryptography.fernet import Fernet
def _derive_fernet_key(master_key: bytes, user_id: str) -> Fernet:
"""Derive a per-user Fernet key using HKDF-SHA256.
master_key = CLOUD_CREDS_KEY env var as bytes
salt = user_id bytes (deterministic per user — we need same key on decrypt)
info = b"cloud-credentials" (domain separation)
"""
hkdf = HKDF(
algorithm=hashes.SHA256(),
length=32,
salt=user_id.encode("utf-8"), # deterministic salt = user_id
info=b"cloud-credentials",
)
raw_key = hkdf.derive(master_key)
fernet_key = base64.urlsafe_b64encode(raw_key)
return Fernet(fernet_key)
def encrypt_credentials(master_key: bytes, user_id: str, credentials: dict) -> str:
"""Encrypt credentials dict to base64 Fernet token string."""
import json
f = _derive_fernet_key(master_key, user_id)
plaintext = json.dumps(credentials).encode("utf-8")
return f.encrypt(plaintext).decode("utf-8")
def decrypt_credentials(master_key: bytes, user_id: str, credentials_enc: str) -> dict:
"""Decrypt credentials_enc back to dict."""
import json
f = _derive_fernet_key(master_key, user_id)
plaintext = f.decrypt(credentials_enc.encode("utf-8"))
return json.loads(plaintext)
```
**Critical note:** HKDF is **not** reusable — a new `HKDF` instance must be created for each derivation call. The `cryptography` library raises `AlreadyFinalized` if `.derive()` is called twice on the same instance. The `_derive_fernet_key` function must create a fresh `HKDF` instance each call.
### Pattern 3: Google Drive OAuth2 Flow via google-auth-oauthlib
```python
# Source: googleapis.dev/python/google-auth-oauthlib/latest (VERIFIED: official docs)
from google_auth_oauthlib.flow import Flow
# At initiation:
flow = Flow.from_client_config(
{
"web": {
"client_id": settings.google_client_id,
"client_secret": settings.google_client_secret,
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
}
},
scopes=["https://www.googleapis.com/auth/drive.file"],
)
flow.redirect_uri = f"{settings.backend_url}/api/cloud/oauth/callback/google_drive"
authorization_url, state = flow.authorization_url(access_type="offline", prompt="consent")
# Store state → Redis (key: oauth_state:{state}, value: user_id, TTL 30 min)
# Redirect browser to authorization_url
# At callback:
# Restore flow from client config (stateless — recreate Flow on each callback)
flow = Flow.from_client_config(client_config, scopes=[...], state=state)
flow.redirect_uri = redirect_uri
flow.fetch_token(code=code)
creds = flow.credentials
# creds.token = access token
# creds.refresh_token = refresh token
# creds.expiry = datetime
```
**`access_type="offline"` is required** to obtain a refresh token. Without it, Google only returns a short-lived access token. `prompt="consent"` forces re-consent on each connect, which ensures a fresh refresh token.
### Pattern 4: OneDrive OAuth2 Flow via MSAL
```python
# Source: learn.microsoft.com/en-us/entra/msal/python/ [CITED]
import msal
# Confidential client app (has client_secret)
app = msal.ConfidentialClientApplication(
client_id=settings.onedrive_client_id,
client_credential=settings.onedrive_client_secret,
authority=f"https://login.microsoftonline.com/{settings.onedrive_tenant_id}",
)
# At initiation:
auth_url = app.get_authorization_request_url(
scopes=["Files.ReadWrite", "offline_access"],
redirect_uri=f"{settings.backend_url}/api/cloud/oauth/callback/onedrive",
state=state_token,
)
# Redirect browser to auth_url
# At callback:
result = app.acquire_token_by_authorization_code(
code=code,
scopes=["Files.ReadWrite", "offline_access"],
redirect_uri=redirect_uri,
)
# result["access_token"] — short-lived access token
# result["refresh_token"] — long-lived refresh token
# result["expires_in"] — seconds until access_token expires
# Refresh on-demand (D-05):
result = app.acquire_token_by_refresh_token(
refresh_token=stored_refresh_token,
scopes=["Files.ReadWrite", "offline_access"],
)
# If result.get("error") == "invalid_grant" → REQUIRES_REAUTH (D-06)
```
**`offline_access` scope is required** to obtain a refresh token from Microsoft identity platform. The `tenant_id` can be `"common"` for multi-tenant apps (personal OneDrive and organizational accounts). For personal OneDrive only, use `"consumers"`.
### Pattern 5: WebDAV Operations via webdavclient3 + asyncio.to_thread
```python
# Source: pypi.org/project/webdavclient3 (VERIFIED: PyPI) [ASSUMED: specific API usage]
import asyncio
from webdav3.client import Client
class WebDAVBackend(StorageBackend):
def __init__(self, server_url: str, username: str, password: str):
options = {
"webdav_hostname": server_url,
"webdav_login": username,
"webdav_password": password,
}
self._client = Client(options)
self._base_path = "docuvault/" # namespace prefix in WebDAV tree
async def put_object(self, user_id, document_id, file_bytes, extension, content_type) -> str:
# object_key = WebDAV path used as identifier
object_key = f"docuvault/{user_id}/{document_id}{extension}"
import io
buf = io.BytesIO(file_bytes)
await asyncio.to_thread(
self._client.upload_to, buf, object_key
)
return object_key
async def get_object(self, object_key: str) -> bytes:
import io
buf = io.BytesIO()
await asyncio.to_thread(self._client.download_from, buf, object_key)
return buf.getvalue()
```
Note: `webdavclient3` is synchronous. All calls MUST be wrapped in `asyncio.to_thread()` — same pattern as `MinIOBackend`. [ASSUMED: `upload_to`/`download_from` method names — verify against installed package docs]
### Pattern 6: SSRF Prevention via ipaddress Module
```python
# Source: python.org/library/ipaddress [VERIFIED: Python stdlib]
import ipaddress
import socket
from urllib.parse import urlparse
BLOCKED_NETS = [
ipaddress.ip_network("127.0.0.0/8"), # loopback
ipaddress.ip_network("169.254.0.0/16"), # link-local
ipaddress.ip_network("10.0.0.0/8"), # RFC 1918
ipaddress.ip_network("172.16.0.0/12"), # RFC 1918
ipaddress.ip_network("192.168.0.0/16"), # RFC 1918
ipaddress.ip_network("::1/128"), # IPv6 loopback
ipaddress.ip_network("fc00::/7"), # IPv6 ULA
]
def validate_cloud_url(url: str) -> None:
"""Raise ValueError if url targets a private/internal address.
Called at connect-time and before every WebDAV/Nextcloud request.
D-17: blocks localhost, 127.x, 169.254.x, RFC 1918 ranges, ::1.
"""
parsed = urlparse(url)
if parsed.scheme not in ("http", "https"):
raise ValueError(f"Unsupported scheme: {parsed.scheme}")
hostname = parsed.hostname
if not hostname:
raise ValueError("URL has no hostname")
# Resolve hostname to IP
try:
addr = ipaddress.ip_address(hostname)
except ValueError:
# Not a raw IP — resolve via DNS
try:
resolved = socket.getaddrinfo(hostname, None)[0][4][0]
addr = ipaddress.ip_address(resolved)
except (socket.gaierror, ValueError) as exc:
raise ValueError(f"Cannot resolve hostname: {exc}") from exc
for net in BLOCKED_NETS:
if addr in net:
raise ValueError(f"URL targets a private/internal address: {addr}")
```
**Security note:** DNS-based SSRF bypass is a known attack vector — an attacker registers a DNS name that resolves to an internal IP. The `validate_cloud_url` function must resolve DNS and check the resolved IP, not just the hostname string. This pattern is the OWASP-recommended approach. [CITED: cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html]
### Pattern 7: OAuth State Storage via Redis
```python
# Source: established pattern from backend/api/auth.py (VERIFIED: codebase)
# Redis is already on app.state.redis (aioredis client)
# At OAuth initiation:
state_token = secrets.token_urlsafe(32)
redis_key = f"oauth_state:{state_token}"
await request.app.state.redis.setex(
redis_key,
1800, # 30-minute TTL — long enough for user to complete OAuth consent
str(current_user.id),
)
# Return redirect to authorization_url with state=state_token
# At OAuth callback:
redis_key = f"oauth_state:{state}"
user_id_bytes = await request.app.state.redis.get(redis_key)
if not user_id_bytes:
raise HTTPException(400, "Invalid or expired OAuth state")
await request.app.state.redis.delete(redis_key) # single-use
user_id = uuid.UUID(user_id_bytes.decode())
```
This follows the exact same pattern as TOTP replay prevention in `auth.py` — Redis TTL key, single-use deletion after validation.
### Pattern 8: TTLCache for Folder Listings (cachetools)
```python
# Source: cachetools.readthedocs.io [CITED]
import threading
from cachetools import TTLCache
# In FastAPI lifespan or module-level singleton
# maxsize=1000: enough for ~50 users × 20 folder nodes each
# ttl=60: 60-second cache per D-16
_folder_cache: TTLCache = TTLCache(maxsize=1000, ttl=60)
_folder_cache_lock = threading.Lock()
async def get_cloud_folders_cached(user_id: str, provider: str, folder_id: str, fetch_fn) -> list:
"""Return cached result or call fetch_fn and cache it."""
cache_key = f"{user_id}:{provider}:{folder_id}"
with _folder_cache_lock:
if cache_key in _folder_cache:
return _folder_cache[cache_key]
result = await fetch_fn() # async — outside the lock
with _folder_cache_lock:
_folder_cache[cache_key] = result
return result
```
**Thread safety:** `cachetools.TTLCache` is not thread-safe by itself. A `threading.Lock` is required for concurrent access. The fetch function itself is async and must be called outside the lock to avoid blocking the event loop. [CITED: cachetools.readthedocs.io — "access to a shared cache from multiple threads must be properly synchronized"]
### Pattern 9: Factory Extension (get_storage_backend)
```python
# Source: backend/storage/__init__.py (VERIFIED: codebase)
# Current factory only returns MinIOBackend. Phase 5 extends it:
async def get_storage_backend_for_document(
document: Document,
user: User,
session: AsyncSession,
) -> StorageBackend:
"""Return the correct StorageBackend for the given document.
MinIO documents (storage_backend='minio'): return shared MinIOBackend.
Cloud documents: load CloudConnection, decrypt credentials, return backend instance.
"""
if document.storage_backend == "minio":
return get_storage_backend() # existing factory
# Load cloud connection
result = await session.execute(
select(CloudConnection).where(
CloudConnection.user_id == user.id,
CloudConnection.provider == document.storage_backend,
CloudConnection.status == "ACTIVE",
)
)
conn = result.scalar_one_or_none()
if conn is None:
raise HTTPException(503, "Cloud connection not found or inactive")
master_key = settings.cloud_creds_key.encode()
credentials = decrypt_credentials(master_key, str(user.id), conn.credentials_enc)
if document.storage_backend == "google_drive":
return GoogleDriveBackend(credentials)
elif document.storage_backend == "onedrive":
return OneDriveBackend(credentials)
elif document.storage_backend in ("nextcloud", "webdav"):
return WebDAVBackend(credentials["server_url"], credentials["username"], credentials["password"])
else:
raise ValueError(f"Unknown storage backend: {document.storage_backend}")
```
### Pattern 10: On-Demand Token Refresh (D-05)
```python
# Source: D-05 decision (CONTEXT.md) [ASSUMED: exact error class names]
class GoogleDriveBackend(StorageBackend):
async def _call_with_refresh(self, operation_fn, credentials: dict, user_id: str, conn: CloudConnection, session):
"""Attempt operation; on 401, refresh tokens and retry once."""
try:
return await operation_fn(credentials)
except Exception as e:
# Google Drive: googleapiclient.errors.HttpError with status 401
if _is_token_expired_error(e):
new_creds = await self._refresh_token(credentials)
if new_creds is None:
# invalid_grant — set REQUIRES_REAUTH (D-06)
conn.status = "REQUIRES_REAUTH"
await session.commit()
raise CloudConnectionError("Cloud connection requires re-authentication")
# Update credentials_enc
master_key = settings.cloud_creds_key.encode()
conn.credentials_enc = encrypt_credentials(master_key, user_id, new_creds)
conn.status = "ACTIVE"
await session.commit()
return await operation_fn(new_creds)
raise
```
### Anti-Patterns to Avoid
- **Storing OAuth state in FastAPI process memory:** Multi-instance deployments will fail because the callback may arrive at a different instance than the one that created the state. Use Redis.
- **Reusing the HKDF instance:** The `cryptography` library raises `AlreadyFinalized` on second call to `.derive()`. Always create a new `HKDF` instance per key derivation.
- **Checking hostname string for SSRF, not resolved IP:** `validate_cloud_url("http://internal.corp")` would pass a string check but may resolve to `10.0.0.1`. Always resolve DNS and check the resulting IP.
- **Returning `credentials_enc` in any API response:** The `CloudConnectionOut` Pydantic model (already in `admin.py`) is the whitelist — use it for all cloud connection responses.
- **Calling cloud SDK methods from the async event loop without `asyncio.to_thread()`:** All cloud SDKs (`google-api-python-client`, `msal`, `webdavclient3`) are synchronous. Blocking the event loop kills throughput.
- **Using `prompt="consent"` only on first connect:** Without `prompt="consent"`, Google may not return a refresh token on reconnect if the app was previously authorized. Always pass `prompt="consent"` to guarantee a fresh refresh token.
- **Single cloud_connections row per user:** The schema supports multiple providers simultaneously (one row per provider per user, D-13). The upsert logic must match on `(user_id, provider)` not just `user_id`.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| OAuth2 PKCE + token exchange for Google | Custom HMAC/base64 code verifier | `google_auth_oauthlib.flow.Flow` | Handles RFC 7636 PKCE, redirect URI validation, and token serialization |
| OAuth2 for Microsoft Graph | Raw `requests` calls to login.microsoftonline.com | `msal.ConfidentialClientApplication` | MSAL handles token cache, `invalid_grant` detection, tenant routing, and PKCE |
| WebDAV PROPFIND XML | Raw `httpx` with hand-coded XML bodies | `webdavclient3.Client` | PROPFIND response parsing, multistatus handling, redirect following |
| Fernet encryption + key derivation | AES-GCM + custom key stretching | `cryptography` Fernet + HKDF | Fernet is misuse-resistant (authenticated encryption with IV, HMAC tag) — hand-rolled AES can fail silently |
| Private IP detection for SSRF | Regex on URL string | `ipaddress.ip_network().supernet_of()` | Python's `ipaddress` module handles IPv4/IPv6 edge cases including `::ffff:127.0.0.1` mapped addresses |
| In-memory TTL cache | `dict` with `asyncio.get_event_loop().time()` comparison | `cachetools.TTLCache` | TTLCache handles concurrent access with a lock, LRU eviction, and correct TTL semantics |
| OAuth state token validation | JWT with custom HMAC | Redis TTL key | Redis TTL provides natural expiry + single-use deletion; no new secret required |
**Key insight:** All cloud credential handling is a solved problem at the library level. The most common Phase 5 failure mode would be attempting to re-implement OAuth token exchange logic that edge cases around redirect URI matching, PKCE, and token format silently break.
---
## Common Pitfalls
### Pitfall 1: Google Refresh Token Only Issued Once
**What goes wrong:** User connects Google Drive; the first connection includes a refresh token. Later the user disconnects and reconnects. Google does not issue a new refresh token because the user already authorized the app — the re-authorization returns only an access token. Credentials are stored but the connection goes stale in 1 hour.
**Why it happens:** Google only issues a refresh token on the first authorization for a given client_id + user pair, or when `prompt="consent"` is explicitly passed.
**How to avoid:** Always pass `prompt="consent"` and `access_type="offline"` in `flow.authorization_url()`.
**Warning signs:** `credentials.refresh_token` is `None` after `flow.fetch_token()`.
### Pitfall 2: webdavclient3 Path Encoding for Nextcloud
**What goes wrong:** Nextcloud returns 404 or 207 Multi-Status with an empty propfind result for paths with spaces or non-ASCII characters when the path is not percent-encoded.
**Why it happens:** Nextcloud's WebDAV endpoint requires percent-encoded paths; webdavclient3 may or may not encode paths depending on the method called.
**How to avoid:** Use `urllib.parse.quote()` on all path segments before passing to webdavclient3 operations that accept raw paths. [ASSUMED — verify against webdavclient3 docs during implementation]
**Warning signs:** Works with ASCII-only filenames; fails with spaces or umlauts.
### Pitfall 3: HKDF AlreadyFinalized Error
**What goes wrong:** `cryptography.exceptions.AlreadyFinalized` is raised when `HKDF.derive()` is called a second time on the same instance.
**Why it happens:** HKDF is a one-shot operation by design in the `cryptography` library.
**How to avoid:** Create a new `HKDF(...)` instance inside `_derive_fernet_key()` on every call — never store or reuse the HKDF instance.
**Warning signs:** Works in unit tests (each test creates a fresh instance), fails under concurrent load or in repeated calls within the same request.
### Pitfall 4: OAuth Callback State Mismatch in Multi-Instance Deployment
**What goes wrong:** State token is stored in a Python dict in-process. The OAuth callback arrives at a different uvicorn instance → `invalid state` error.
**Why it happens:** HTTP requests are not session-sticky in a load-balanced deployment.
**How to avoid:** Store OAuth state in Redis (`app.state.redis`) with a 30-minute TTL. [VERIFIED: Redis already wired in codebase at `app.state.redis`]
**Warning signs:** OAuth works in single-instance Docker Compose but fails intermittently in production.
### Pitfall 5: DNS Rebinding Attack on SSRF Validation
**What goes wrong:** `validate_cloud_url` resolves `attacker.com` to `8.8.8.8` (passes validation), then the subsequent request resolves `attacker.com` to `169.254.169.254` (cloud metadata endpoint). The validation and the actual request see different IPs.
**Why it happens:** DNS TTL expires between validation and request; attacker controls the DNS.
**How to avoid:** Use `socket.create_connection` with the pre-validated IP directly (pin the IP), or document that a network-level egress firewall is the defense-in-depth layer for DNS rebinding. The `validate_cloud_url` utility call immediately before each request (not once at connect time) reduces the window. [CITED: cheatsheetseries.owasp.org]
**Warning signs:** SSRF test passes with direct IP inputs but might miss DNS-based attacks.
### Pitfall 6: Microsoft Graph Upload Size Limit
**What goes wrong:** Files larger than 4 MB fail with `413 Request Entity Too Large` when uploaded via a single PUT/POST to Microsoft Graph.
**Why it happens:** Microsoft Graph's simple upload endpoint is limited to 4 MB. Larger files require a resumable upload session (`createUploadSession`).
**How to avoid:** For Phase 5, implement resumable upload sessions for files > 4 MB. Use `POST /me/drive/root:/{path}:/createUploadSession` to get an upload URL, then upload in 10 MB chunks.
**Warning signs:** Tests with small files pass; production uploads of real documents (> 4 MB) fail silently or with 413.
### Pitfall 7: Google Drive file() Service is Synchronous
**What goes wrong:** `googleapiclient.discovery.build()` and all `service.files().xxx().execute()` calls are synchronous and block the event loop.
**Why it happens:** `google-api-python-client` was built before asyncio was standard.
**How to avoid:** Wrap every SDK call in `asyncio.to_thread()`. Do NOT await `service.files().list()` directly — it is not a coroutine.
**Warning signs:** FastAPI request handler completes quickly in tests but blocks under load.
---
## Code Examples
### Credential Round-Trip Test (CLOUD-02)
```python
# Source: based on cryptography.io HKDF docs [CITED: cryptography.io]
import base64
import json
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from cryptography.fernet import Fernet
def test_credential_encryption_round_trip():
master_key = b"test-master-key-32bytes-padded!!" # 32 bytes
user_id = "550e8400-e29b-41d4-a716-446655440000"
credentials = {"access_token": "ya29.xxx", "refresh_token": "1//xxx", "expiry": "2026-05-28T15:00:00"}
encrypted = encrypt_credentials(master_key, user_id, credentials)
assert isinstance(encrypted, str)
assert "access_token" not in encrypted # not plaintext
decrypted = decrypt_credentials(master_key, user_id, credentials)
assert decrypted == credentials
```
### SSRF Validation Test
```python
# Source: pattern derived from OWASP SSRF cheat sheet [CITED: cheatsheetseries.owasp.org]
import pytest
@pytest.mark.parametrize("url,should_raise", [
("http://localhost/dav", True),
("http://127.0.0.1/dav", True),
("http://169.254.169.254/dav", True),
("http://10.0.0.1/dav", True),
("http://192.168.1.1/dav", True),
("http://172.16.0.1/dav", True),
("https://nextcloud.example.com/remote.php/dav", False),
("http://::1/dav", True),
])
def test_ssrf_validation(url, should_raise):
if should_raise:
with pytest.raises(ValueError):
validate_cloud_url(url)
else:
validate_cloud_url(url) # no exception
```
### CloudConnectionOut Whitelist Enforcement
```python
# Source: backend/api/admin.py (VERIFIED: codebase)
# The CloudConnectionOut model already exists in admin.py.
# ALL cloud connection endpoints must use this model, not CloudConnection ORM directly.
class CloudConnectionOut(BaseModel):
id: str
provider: str
display_name: str
status: str
connected_at: datetime
model_config = {"from_attributes": True}
# Usage in cloud.py:
@router.get("/api/cloud/connections")
async def list_connections(
current_user: User = Depends(get_regular_user),
session: AsyncSession = Depends(get_db),
) -> dict:
result = await session.execute(
select(CloudConnection).where(CloudConnection.user_id == current_user.id)
)
connections = result.scalars().all()
return {"items": [CloudConnectionOut.model_validate(c).model_dump() for c in connections]}
```
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| Storing OAuth state in Flask/FastAPI session (in-memory) | Redis TTL-keyed state tokens | ~2022 with multi-instance deployments becoming standard | Multi-instance safety; prevents token fixation |
| webdav-client-python (original) | webdavclient3 (fork, actively maintained) | 2018 | webdav-client-python is unmaintained; webdavclient3 is the maintained fork |
| `google.oauth2.credentials.Credentials` with service accounts | `google-auth-oauthlib` Flow for user-delegated access | 2019 | Service accounts require GSuite domain; user OAuth is required for personal Drive |
| ADAL (Azure Active Directory Authentication Library) for Python | MSAL (Microsoft Authentication Library) | 2020; ADAL deprecated | ADAL end-of-life June 2023; MSAL is the replacement |
| Using `Fernet.generate_key()` with user passwords | HKDF + Fernet (key derivation before Fernet) | Ongoing best practice | Fernet keys must be 32 random bytes; `generate_key()` generates fresh random keys, not deterministic per-user keys |
**Deprecated/outdated:**
- `adal` Python package: End-of-life; replaced by `msal`. Do NOT use.
- `webdav-client-python` (without the `3`): Unmaintained since ~2018. Use `webdavclient3`.
- `google.oauth2.service_account.Credentials`: For service accounts, not user-delegated Drive access. Wrong tool for this use case.
---
## Assumptions Log
| # | Claim | Section | Risk if Wrong |
|---|-------|---------|---------------|
| A1 | `webdavclient3` uses `upload_to` / `download_from` method names for stream-based operations | Architecture Patterns Pattern 5 | Planner must verify method signatures against installed package; wrong method names cause `AttributeError` at test time |
| A2 | Google Drive `googleapiclient.errors.HttpError` status 401 is the token-expiry signal | Pattern 10: On-Demand Token Refresh | Actual exception class may differ; must verify during implementation with a real expired token |
| A3 | Microsoft Graph `invalid_grant` error appears in `result["error"]` from `msal.acquire_token_by_refresh_token` | Pattern 10 | MSAL may use a different error field or raise an exception; verify against msal docs |
| A4 | `webdavclient3` percent-encodes paths automatically | Pitfall 2 | May require manual encoding; verify during WebDAV backend implementation |
| A5 | `tenant_id="common"` works for both personal OneDrive and organizational accounts | Pattern 4: MSAL | May require `"consumers"` for personal accounts; verify against Microsoft docs for the target use case |
---
## Open Questions (RESOLVED)
1. **Google Drive object key scheme for `stat_object`**
- What we know: MinIO `stat_object` returns size in bytes from the storage layer. Google Drive returns file metadata including `size` from `files.get(fileId, fields='size')`.
- What's unclear: Google Drive may not return `size` for Google Workspace files (Docs, Sheets, Slides) since they have no binary size. DocuVault uploads binary files, so this may not be an issue in practice.
- Recommendation: Implement `stat_object` using `service.files().get(fileId=object_key, fields="size").execute()` and return `int(metadata["size"])`. Add a fallback of `0` for files without a size.
- **RESOLVED:** Use `service.files().get(fileId=object_key, fields="size").execute()` and return `int(metadata.get("size", 0))`. DocuVault only uploads binary files so the `0` fallback handles edge cases without breaking functionality.
2. **Nextcloud folder listing path convention**
- What we know: Nextcloud WebDAV base path is typically `/remote.php/dav/files/{username}/`.
- What's unclear: Whether the `webdavclient3` `Client` automatically handles the `/remote.php/dav/files/{username}/` prefix or whether it must be included in the `server_url`.
- Recommendation: Store `server_url` as the full WebDAV root (e.g., `https://nc.example.com/remote.php/dav/files/alice/`) and use relative paths within it. Test with PROPFIND on the root to validate the connection (D-08).
- **RESOLVED:** `server_url` stores the full WebDAV root including the `/remote.php/dav/files/{username}/` prefix. All relative paths within WebDAVBackend and NextcloudBackend are appended to this base. Connection validation uses a PROPFIND on the root path per D-08.
3. **Microsoft Graph upload for files > 4 MB**
- What we know: Simple upload (PUT `/me/drive/root:/{path}:/content`) is limited to 4 MB. Resumable sessions handle larger files.
- What's unclear: The Phase 5 plan should specify whether to implement resumable sessions upfront or use a 4 MB size gate.
- Recommendation: Implement resumable upload session (`createUploadSession`) for all files to avoid the hard limit. It handles both small and large files without a size check.
- **RESOLVED:** Implement `createUploadSession` for ALL file sizes (no size gate). `CHUNK_SIZE = 10 * 1024 * 1024` (10 MB, above Graph 4 MB limit) used in all OneDrive uploads. Pitfall 6 documented in Common Pitfalls section.
---
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| Python 3.12 (Docker) | All backends | In Docker container | 3.12.x | — |
| Redis | OAuth state storage | In Docker Compose | 6.x+ | — |
| PostgreSQL | cloud_connections table | In Docker Compose | 15.x | — |
| `cryptography` package | Credential encryption | NOT in requirements.txt | — | Must be added (48.0.0 verified) |
| `google-auth-oauthlib` | Google Drive OAuth | NOT in requirements.txt | — | Must be added (1.3.1 verified) |
| `google-api-python-client` | Google Drive API | NOT in requirements.txt | — | Must be added (2.196.0 verified) |
| `msal` | OneDrive OAuth | NOT in requirements.txt | — | Must be added (1.36.0 verified) |
| `webdavclient3` | WebDAV/Nextcloud | NOT in requirements.txt | — | Must be added (3.14.7 verified) |
| `cachetools` | Folder listing cache | NOT in requirements.txt | — | Must be added (6.2.6 verified) |
| Google OAuth App (Azure/GCP console) | Google Drive integration | NOT CONFIGURED | — | Must be created by user; client_id/client_secret added to .env |
| Microsoft App Registration (Azure portal) | OneDrive integration | NOT CONFIGURED | — | Must be created by user; client_id/client_secret/tenant_id added to .env |
**Missing dependencies with no fallback:**
- `cryptography`, `google-auth-oauthlib`, `google-api-python-client`, `msal`, `webdavclient3`, `cachetools` — must be added to `requirements.txt` before any cloud backend code runs.
**Missing dependencies with fallback (soft):**
- Google OAuth App credentials: Integration tests for Google Drive will need mocked OAuth flows if real GCP app is not configured. Unit tests can mock the entire OAuth flow.
- Microsoft App Registration: Same as above for OneDrive.
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | pytest + pytest-asyncio (already in requirements.txt) |
| Config file | `backend/pytest.ini` (already exists) |
| Quick run command | `cd backend && pytest tests/test_cloud.py -x -v` |
| Full suite command | `cd backend && pytest -v` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| CLOUD-01 | User can connect all 4 providers | Integration | `pytest tests/test_cloud.py::test_connect_google_drive -x` | ❌ Wave 0 |
| CLOUD-01 | OAuth callback validates state and saves connection | Integration | `pytest tests/test_cloud.py::test_oauth_callback_valid_state -x` | ❌ Wave 0 |
| CLOUD-01 | Invalid OAuth state returns 400 | Integration | `pytest tests/test_cloud.py::test_oauth_callback_invalid_state -x` | ❌ Wave 0 |
| CLOUD-01 | WebDAV/Nextcloud connection validated before save (D-08) | Integration | `pytest tests/test_cloud.py::test_webdav_connect_validates -x` | ❌ Wave 0 |
| CLOUD-02 | Credential encryption/decryption round-trip | Unit | `pytest tests/test_cloud.py::test_credential_round_trip -x` | ❌ Wave 0 |
| CLOUD-02 | `credentials_enc` not in any API response (SEC-08) | Integration | `pytest tests/test_cloud.py::test_credentials_enc_not_exposed -x` | ❌ Wave 0 |
| CLOUD-03 | Upload to cloud folder goes through FastAPI (not presigned URL) | Integration | `pytest tests/test_cloud.py::test_cloud_upload_no_presigned -x` | ❌ Wave 0 |
| CLOUD-04 | Connection status displayed correctly | Integration | `pytest tests/test_cloud.py::test_connection_status_display -x` | ❌ Wave 0 |
| CLOUD-05 | `invalid_grant``REQUIRES_REAUTH` transition | Integration | `pytest tests/test_cloud.py::test_invalid_grant_sets_requires_reauth -x` | ❌ Wave 0 |
| CLOUD-06 | Disconnect permanently deletes credentials | Integration | `pytest tests/test_cloud.py::test_disconnect_deletes_credentials -x` | ❌ Wave 0 |
| CLOUD-07 | StorageBackend factory returns correct type | Unit | `pytest tests/test_cloud.py::test_factory_returns_correct_backend -x` | ❌ Wave 0 |
| D-17 | SSRF validation blocks RFC-1918 and loopback | Unit | `pytest tests/test_cloud.py::test_ssrf_validation -x` | ❌ Wave 0 |
| D-17 | SSRF validation blocks 169.254.x link-local | Unit | `pytest tests/test_cloud.py::test_ssrf_link_local -x` | ❌ Wave 0 |
| SEC | Admin cannot access cloud connection credentials | Integration | `pytest tests/test_cloud.py::test_admin_cannot_see_credentials -x` | ❌ Wave 0 |
| SEC | Cross-user cloud connection access returns 404 | Integration | `pytest tests/test_cloud.py::test_cross_user_idor -x` | ❌ Wave 0 |
### Sampling Rate
- **Per task commit:** `cd backend && pytest tests/test_cloud.py -x -v`
- **Per wave merge:** `cd backend && pytest -v`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `backend/tests/test_cloud.py` — all Phase 5 tests (unit + integration), starting with xfail stubs
- [ ] New conftest fixtures: `mock_google_drive_creds`, `mock_onedrive_creds`, `mock_webdav_client`, `cloud_connection_factory`
---
## Security Domain
### Applicable ASVS Categories
| ASVS Category | Applies | Standard Control |
|---------------|---------|-----------------|
| V2 Authentication | yes | OAuth2 state CSRF; per-session token; `get_regular_user` dep on all cloud endpoints |
| V3 Session Management | yes | OAuth state token is single-use; stored in Redis with TTL; deleted after callback |
| V4 Access Control | yes | Every `/api/cloud/*` endpoint asserts `connection.user_id == current_user.id` before operations |
| V5 Input Validation | yes | `validate_cloud_url()` for WebDAV/Nextcloud; Pydantic models for all request bodies; no raw string interpolation in URLs |
| V6 Cryptography | yes | HKDF + Fernet for credential encryption; AES-256 via `cryptography` library (never hand-rolled) |
| V7 Error Handling | yes | `invalid_grant` handled explicitly (D-06); no stack traces in cloud API error responses |
### Known Threat Patterns for OAuth + Cloud Storage
| Pattern | STRIDE | Standard Mitigation |
|---------|--------|---------------------|
| CSRF on OAuth callback | Tampering | `state` parameter validated via Redis; state token is `secrets.token_urlsafe(32)` |
| SSRF via WebDAV/Nextcloud URL | Tampering / Information Disclosure | `validate_cloud_url()` at connect-time and before each request; `ipaddress` module DNS resolution check |
| Credential exposure via API leak | Information Disclosure | `CloudConnectionOut` Pydantic whitelist; `credentials_enc` excluded by omission |
| Token replay via OAuth state | Elevation of Privilege | Redis single-use deletion after callback; 30-minute TTL prevents stale states |
| Cross-user cloud connection access | IDOR | `connection.user_id == current_user.id` assertion on every operation; 404 not 403 |
| Unverified credentials stored (D-08) | Information Disclosure / DoS | PROPFIND/OPTIONS validation before storage; error returned on failure |
| Refresh token theft from DB | Information Disclosure | `credentials_enc` is Fernet-encrypted with HKDF per-user key; master key in env var only |
| Admin accessing user cloud credentials | Broken Access Control | `get_regular_user` dep blocks admin (403); `CloudConnectionOut` whitelist on all responses |
| DNS rebinding SSRF bypass | Tampering | `validate_cloud_url()` called immediately before each outbound request (not only at connect-time); documented defense-in-depth via network egress firewall |
---
## Project Constraints (from CLAUDE.md)
The following CLAUDE.md directives are binding for Phase 5:
- JWT access token lives in Pinia memory only — never localStorage or sessionStorage (OAuth callback must redirect to Vue with a query param, not embed tokens in the URL)
- Cloud credentials encrypted with HKDF per-user key derivation — master key in env var only
- Admin endpoints never return `credentials_enc`
- Every cloud connection endpoint asserts `resource.user_id == current_user.id`
- All DB queries via ORM / parameterized statements — zero raw string interpolation
- `get_regular_user` on all cloud connection endpoints (admin blocked from this surface)
- `write_audit_log()` called on cloud connect, disconnect, and re-auth events
- Testing protocol: every new function, endpoint, and component must have at least one test; `pytest -v` must pass zero failures
- Security gate: `bandit -r backend/`, `pip audit`, `npm audit --audit-level=high` must all pass before phase advancement
- Bug fix rule: root cause only, ≤50 lines, regression test required
---
## Sources
### Primary (HIGH confidence)
- `backend/storage/base.py` — StorageBackend ABC, 7 abstract methods, exact signatures
- `backend/storage/minio_backend.py` — asyncio.to_thread() wrapping pattern, error handling shape
- `backend/storage/__init__.py` — factory pattern to extend
- `backend/db/models.py` — CloudConnection model fields, Document.storage_backend, User.default_storage_backend
- `backend/api/admin.py` — CloudConnectionOut Pydantic whitelist pattern (already exists)
- `backend/main.py` — Redis wiring on app.state.redis, lifespan pattern
- `backend/deps/auth.py` — get_regular_user, get_current_user patterns
- `backend/migrations/versions/0001_initial_schema.py` — confirmed cloud_connections table, storage_backend columns
- [cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/](https://cryptography.io/en/latest/hazmat/primitives/key-derivation-functions/) — HKDF usage and info parameter
- [cryptography.io/en/latest/fernet/](https://cryptography.io/en/latest/fernet/) — Fernet key format
- [googleapis.dev/python/google-auth-oauthlib/latest](https://googleapis.dev/python/google-auth-oauthlib/latest/reference/google_auth_oauthlib.flow.html) — Flow class API
- PyPI `pip download` — confirmed versions: cryptography-48.0.0, google_auth_oauthlib-1.3.1, google_api_python_client-2.196.0, msal-1.36.0, webdavclient3-3.14.7, cachetools-6.2.6
- slopcheck 0.6.1 — all 7 packages rated [OK]
### Secondary (MEDIUM confidence)
- [learn.microsoft.com/en-us/entra/msal/python/](https://learn.microsoft.com/en-us/entra/msal/python/) — MSAL Python overview and authorization code flow
- [cachetools.readthedocs.io](https://cachetools.readthedocs.io/en/stable/) — TTLCache thread safety requirement
- [cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html](https://cheatsheetseries.owasp.org/cheatsheets/Server_Side_Request_Forgery_Prevention_Cheat_Sheet.html) — DNS resolution-based SSRF check
### Tertiary (LOW confidence / ASSUMED)
- webdavclient3 specific method names (`upload_to`, `download_from`) — marked [ASSUMED] above; verify during implementation
- Exact Microsoft Graph error field for `invalid_grant` in MSAL — marked [ASSUMED] above
---
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH — all packages verified on PyPI, slopcheck clean, versions confirmed
- Architecture: HIGH — built directly from codebase inspection; ABC, factory, CloudConnection model, Redis wiring all verified
- OAuth2 flows: MEDIUM/HIGH — google-auth-oauthlib Flow API verified via official docs; MSAL pattern confirmed via Microsoft docs
- Pitfalls: HIGH — based on official library docs and known OAuth edge cases
- SSRF prevention: HIGH — Python stdlib ipaddress module; OWASP-cited approach
**Research date:** 2026-05-28
**Valid until:** 2026-06-28 (30 days) — package versions are stable but verify before pinning in requirements.txt