--- phase: 05-cloud-storage-backends plan: 02 subsystem: api tags: [cryptography, hkdf, fernet, ssrf, ipaddress, cachetools, ttlcache, cloud-storage, storage-factory] # Dependency graph requires: - phase: 05-cloud-storage-backends plan: 01 provides: "Wave 0 xfail stubs in test_cloud.py, cloud Settings fields (cloud_creds_key), cachetools pin in requirements.txt" provides: - "backend/storage/cloud_utils.py: validate_cloud_url (SSRF), encrypt_credentials, decrypt_credentials, _derive_fernet_key (HKDF)" - "backend/services/cloud_cache.py: TTLCache(maxsize=1000,ttl=60) singleton, get_cloud_folders_cached (async), invalidate_provider_cache" - "backend/storage/__init__.py: get_storage_backend_for_document() async factory" - "backend/tests/test_cloud_utils.py: 27 green tests covering SSRF, HKDF round-trip, cache, factory" affects: [05-03, 05-04, 05-05, 05-06, 05-07, 05-08] # Tech tracking tech-stack: added: - cryptography (HKDF-SHA256, Fernet AES-256-GCM) — installed locally for testing - cachetools (TTLCache) — installed locally for testing patterns: - "Fresh HKDF instance per call (AlreadyFinalized pitfall avoidance — RESEARCH.md Pitfall 3)" - "DNS-resolved SSRF check: socket.getaddrinfo before ipaddress.ip_network membership test" - "Explicit localhost string block before DNS resolution (OS-agnostic edge case)" - "Fetch-outside-lock async cache pattern: acquire lock to check, release, await fetch_fn, acquire lock to write" - "Lazy import inside get_storage_backend_for_document to avoid circular imports at module load time" key-files: created: - backend/storage/cloud_utils.py - backend/services/cloud_cache.py - backend/tests/test_cloud_utils.py modified: - backend/storage/__init__.py key-decisions: - "Explicit 'localhost' string block added before DNS resolution — Python 3.9 getaddrinfo resolves localhost to 127.0.0.1 on macOS but behaviour varies by OS; string check is O(1) and OS-agnostic" - "validate_cloud_url test using 8.8.8.8 (raw public IP) instead of cloud.example.com — example.com does not resolve in offline CI environments" - "type: ignore[import] on lazy cloud backend imports — modules do not exist yet (Plans 05-03..05-05 create them)" - "IPv4/IPv6 family mismatch in ip_network check handled via try/except TypeError to avoid cross-family errors" patterns-established: - "HKDF key derivation: fresh HKDF(...) instance inside _derive_fernet_key() every call, never cached" - "SSRF validation: scheme check → hostname presence → localhost string → raw IP parse → DNS resolve → blocked network membership" - "Cloud factory extension: get_storage_backend_for_document() alongside (not replacing) get_storage_backend()" - "TTLCache thread safety: threading.Lock wraps all _folder_cache reads/writes; fetch_fn awaited outside lock" requirements-completed: - CLOUD-02 - CLOUD-07 # Metrics duration: 18min completed: 2026-05-28 --- # Phase 5 Plan 02: Shared Cloud Utilities Layer Summary **SSRF-safe URL validator (RFC-1918/loopback/link-local/localhost/IPv6 blocked via DNS resolution), HKDF-SHA256+Fernet credential encryption with per-user key derivation, TTLCache(1000, 60s) folder listing cache, and async storage backend factory for per-document backend dispatch** ## Performance - **Duration:** 18 min - **Started:** 2026-05-28T19:10:00Z - **Completed:** 2026-05-28T19:28:00Z - **Tasks:** 2 - **Files modified:** 4 ## Accomplishments - Created `cloud_utils.py` with `validate_cloud_url()` (DNS-resolved SSRF prevention blocking RFC-1918, loopback, link-local, localhost, and IPv6 private ranges), `_derive_fernet_key()` (fresh HKDF instance per call to avoid AlreadyFinalized), `encrypt_credentials()` and `decrypt_credentials()` (Fernet round-trip over JSON-serialised dict) - Created `cloud_cache.py` with module-level `TTLCache(maxsize=1000, ttl=60)` singleton, thread-safe lock, `get_cloud_folders_cached()` async function (fetch-outside-lock pattern), and `invalidate_provider_cache()` sync helper - Extended `storage/__init__.py` with `get_storage_backend_for_document()` async factory: returns MinIOBackend for minio docs, queries CloudConnection scoped to user.id, decrypts credentials, lazy-imports cloud backend classes to avoid circular imports; raises HTTPException(503) if connection missing or inactive - Created `tests/test_cloud_utils.py` with 27 green tests using TDD (RED → GREEN), covering all SSRF cases, HKDF round-trip invariants, TTLCache configuration, async cache behaviour, and factory importability ## Task Commits 1. **RED phase tests** - `7fdffdd` (test) 2. **Task 1: cloud_utils.py — SSRF validation and HKDF credential encryption** - `976d2ca` (feat) 3. **Task 2: cloud_cache.py and storage factory extension** - `fb80379` (feat) ## Files Created/Modified - `/Users/nik/Documents/Progamming/document_scanner/backend/storage/cloud_utils.py` — validate_cloud_url, _derive_fernet_key, encrypt_credentials, decrypt_credentials - `/Users/nik/Documents/Progamming/document_scanner/backend/services/cloud_cache.py` — TTLCache singleton, get_cloud_folders_cached, invalidate_provider_cache - `/Users/nik/Documents/Progamming/document_scanner/backend/storage/__init__.py` — Added get_storage_backend_for_document() async factory alongside existing get_storage_backend() - `/Users/nik/Documents/Progamming/document_scanner/backend/tests/test_cloud_utils.py` — 27 green tests (TDD) ## Decisions Made - Explicit `hostname == "localhost"` string block is added BEFORE DNS resolution. Python's `getaddrinfo("localhost", None)` behaviour varies by OS (macOS resolves to `::1` or `127.0.0.1`; Docker containers sometimes fail), so the string check is more reliable and O(1). - `test_allows_public_https` was written to use `8.8.8.8` (a raw public IP) instead of `cloud.example.com`. The `cloud.example.com` domain does not resolve in offline/sandbox CI environments, causing a spurious test failure unrelated to the SSRF logic being tested. - `# type: ignore[import]` comments added to the lazy imports inside `get_storage_backend_for_document()` because the cloud backend modules (`google_drive_backend.py`, `onedrive_backend.py`, `webdav_backend.py`) do not exist yet — they are created by Plans 05-03 through 05-05. - IPv4/IPv6 family mismatch in `addr in net` is caught via `except TypeError: continue` rather than pre-filtering networks. This is simpler and avoids maintaining two separate network lists. ## Deviations from Plan ### Auto-fixed Issues **1. [Rule 1 - Bug] Fixed SSRF allow test using unresolvable domain** - **Found during:** Task 1 test execution (GREEN phase) - **Issue:** `test_allows_public_https` used `cloud.example.com` which does not resolve in the local (offline) test environment, causing a spurious ValueError from `socket.gaierror` — not a real SSRF failure - **Fix:** Replaced with `https://8.8.8.8/remote.php/dav` (raw public IP, no DNS required) - **Files modified:** `backend/tests/test_cloud_utils.py` - **Verification:** Test now passes; implementation is correct and not changed - **Committed in:** `976d2ca` (part of Task 1 commit) --- **Total deviations:** 1 auto-fixed (Rule 1 - test bug in network-isolated environment) **Impact on plan:** No scope creep. Fix was a test correctness issue, not an implementation change. ## Issues Encountered `cryptography` and `cachetools` were not installed in the local Python 3.9.6 environment (they were added to `requirements.txt` in Plan 05-01 but not installed locally). Installed both via `pip3 install cryptography cachetools` to enable local test execution. This is consistent with the Plan 05-01 SUMMARY note about running tests locally vs. inside Docker. ## Known Stubs None introduced by this plan. The `# type: ignore[import]` comments on the lazy cloud backend imports in `storage/__init__.py` are expected — those modules are created by Plans 05-03 through 05-05 and will be resolved as those plans complete. ## Threat Surface Scan No new network endpoints introduced. All security surfaces are internal utilities: - `validate_cloud_url()` is a pure validation function (no outbound calls) - `encrypt_credentials()` / `decrypt_credentials()` are pure crypto functions - `get_storage_backend_for_document()` is a factory (no new HTTP endpoints) No threat flags raised. ## Next Phase Readiness - All shared utilities are in place. Plans 05-03 through 05-05 can import from `storage.cloud_utils` and `services.cloud_cache` immediately. - `get_storage_backend_for_document()` will work for minio documents now; cloud backends are activated as each backend plan completes. - The 27 new tests in `test_cloud_utils.py` are green; the 19 xfail stubs in `test_cloud.py` remain xfail (correctly — they test API endpoints not yet built). - Full suite: 199 passed / 43 xfailed / 1 pre-existing failure (`test_extract_docx` — python-docx not installed locally, documented in Plan 05-01). ## Self-Check: PASSED Files verified present: - `backend/storage/cloud_utils.py`: FOUND - `backend/services/cloud_cache.py`: FOUND - `backend/storage/__init__.py`: FOUND (contains get_storage_backend_for_document) - `backend/tests/test_cloud_utils.py`: FOUND (27 tests, all passing) Commits verified: - 7fdffdd: test(05-02): add failing RED tests — FOUND - 976d2ca: feat(05-02): implement cloud_utils.py — FOUND - fb80379: feat(05-02): implement cloud_cache.py and extend storage factory — FOUND --- *Phase: 05-cloud-storage-backends* *Completed: 2026-05-28*