docs(05-03): complete GoogleDriveBackend + OneDriveBackend plan

- SUMMARY.md created for Plan 05-03
- STATE.md updated: completed_plans 26→27, progress 81→84%
- Session continuity updated with pytest results (262 passed / 43 xfailed / 1 pre-existing)
- Key decisions added: shared CloudConnectionError, cache_discovery=False, createUploadSession
This commit is contained in:
curo1305
2026-05-28 21:13:53 +02:00
parent a9ea33dd18
commit 6834a6797f
2 changed files with 162 additions and 9 deletions
@@ -0,0 +1,148 @@
---
phase: 05-cloud-storage-backends
plan: 03
subsystem: api
tags: [google-drive, onedrive, microsoft-graph, msal, google-api-python-client, oauth2, asyncio, cloud-storage]
# Dependency graph
requires:
- phase: 05-cloud-storage-backends
plan: 02
provides: "CloudConnectionError (shared exception), StorageBackend ABC, asyncio.to_thread pattern reference (MinIOBackend)"
provides:
- "backend/storage/google_drive_backend.py: GoogleDriveBackend + CloudConnectionError exception class"
- "backend/storage/onedrive_backend.py: OneDriveBackend with resumable upload and MSAL token refresh"
- "backend/tests/test_cloud_backends.py: 32 green TDD tests for both backends"
affects: [05-05, 05-06, 05-07, 05-08]
# Tech tracking
tech-stack:
added:
- google-api-python-client 2.196.0 (Google Drive v3 API — files.create, get_media, delete, list)
- google-auth-oauthlib 1.3.1 (google.oauth2.credentials.Credentials)
- msal 1.36.0 (ConfidentialClientApplication.acquire_token_by_refresh_token)
patterns:
- "Shared exception class: CloudConnectionError(reason=) defined once in google_drive_backend.py, imported by onedrive_backend.py"
- "All sync SDK calls wrapped in asyncio.to_thread() — identical pattern to MinIOBackend"
- "cache_discovery=False on googleapiclient.discovery.build() — prevents /tmp discovery doc writes"
- "B2 design: backends are stateless signal-raisers — raise CloudConnectionError, never update DB"
- "OneDrive resumable upload: createUploadSession for ALL files (no 4 MB size gate)"
- "CHUNK_SIZE = 10 MB — above Graph's 4 MB simple upload limit (Pitfall 6 prevention)"
key-files:
created:
- backend/storage/google_drive_backend.py
- backend/storage/onedrive_backend.py
- backend/tests/test_cloud_backends.py
modified: []
key-decisions:
- "CloudConnectionError defined in google_drive_backend.py and imported by onedrive_backend.py — single shared exception type keeps error handling uniform in the API layer (cloud.py, Plan 05-05)"
- "cache_discovery=False on Drive build() — prevents googleapiclient from writing /tmp discovery cache, avoiding /tmp traversal vector (T-05-03-05)"
- "Resumable upload sessions used for ALL OneDrive uploads regardless of file size — simpler than a size gate and eliminates the 4 MB limit (Pitfall 6, RESEARCH.md Open Question 3)"
- "MSAL invalid_grant detection via result.get('error') == 'invalid_grant' — confirmed as the correct Assumption A3 from RESEARCH.md"
- "_ensure_valid_token() uses 60-second buffer before expiry — reduces race conditions between expiry check and actual API call"
patterns-established:
- "Backend statelessness: cloud backends raise CloudConnectionError(reason=) and never call session.commit()"
- "Google Drive 401 → token_expired; 400 + invalid_grant body → invalid_grant"
- "OneDrive: _ensure_valid_token() + _refresh_token() called before every operation"
requirements-completed:
- CLOUD-01
- CLOUD-05
- CLOUD-07
# Metrics
duration: 6min
completed: 2026-05-28
---
# Phase 5 Plan 03: Google Drive and OneDrive StorageBackend Implementations Summary
**Stateless GoogleDriveBackend (Drive v3 with asyncio.to_thread, cache_discovery=False) and OneDriveBackend (MSAL token refresh, 10 MB resumable upload sessions via createUploadSession) implementing all 7 StorageBackend methods**
## Performance
- **Duration:** 6 min
- **Started:** 2026-05-28T19:05:18Z
- **Completed:** 2026-05-28T19:11:00Z
- **Tasks:** 2
- **Files modified:** 3
## Accomplishments
- Created `google_drive_backend.py` with `CloudConnectionError(reason=)` exception class and `GoogleDriveBackend` implementing all 7 StorageBackend methods. Every sync `googleapiclient` call is wrapped in `asyncio.to_thread()`. `cache_discovery=False` prevents /tmp traversal (T-05-03-05). HttpError 401 raises `CloudConnectionError(reason="token_expired")`; HttpError 400 with "invalid_grant" body raises `CloudConnectionError(reason="invalid_grant")`. `presigned_get_url` and `generate_presigned_put_url` raise `NotImplementedError` (D-14).
- Created `onedrive_backend.py` with `OneDriveBackend` importing the shared `CloudConnectionError` from `google_drive_backend`. `CHUNK_SIZE = 10 * 1024 * 1024` (10 MB). Uses Microsoft Graph `createUploadSession` for all uploads (no 4 MB size gate). `_ensure_valid_token()` checks expiry with 60s buffer; `_refresh_token()` wraps MSAL in `asyncio.to_thread()` and returns `None` on `invalid_grant` to trigger `CloudConnectionError(reason="invalid_grant")`. Both `presigned_*` methods raise `NotImplementedError`.
- Created `tests/test_cloud_backends.py` with 32 TDD tests (RED → GREEN) covering imports, all 7 methods being async, `CHUNK_SIZE`, shared `CloudConnectionError`, `presigned_*` raising `NotImplementedError`, `_init__` correctness, and `_ensure_valid_token` behavior for expired/non-expired tokens.
## Task Commits
Each task was committed atomically following the TDD RED → GREEN cycle:
1. **RED phase tests — both backends** - `4efe7c1` (test)
2. **Task 1: GoogleDriveBackend** - `337ee8e` (feat)
3. **Task 2: OneDriveBackend** - `bcb887e` (feat)
## Files Created/Modified
- `/Users/nik/Documents/Progamming/document_scanner/backend/storage/google_drive_backend.py` — GoogleDriveBackend (all 7 methods) + CloudConnectionError exception class
- `/Users/nik/Documents/Progamming/document_scanner/backend/storage/onedrive_backend.py` — OneDriveBackend (all 7 methods), CHUNK_SIZE, MSAL token refresh, resumable upload
- `/Users/nik/Documents/Progamming/document_scanner/backend/tests/test_cloud_backends.py` — 32 green TDD tests for both backends
## Decisions Made
- `CloudConnectionError` is defined once in `google_drive_backend.py` and imported by `onedrive_backend.py`. This keeps the exception type unified — the API layer in `cloud.py` (Plan 05-05) will catch one exception type regardless of which backend raised it.
- `cache_discovery=False` is explicitly set on `googleapiclient.discovery.build()`. Without this flag, the client writes a JSON discovery document to `/tmp` on first call — this was identified as Threat T-05-03-05 in the plan's threat model.
- `createUploadSession` is used for ALL OneDrive uploads (not only files > 4 MB). This matches RESEARCH.md's resolution of Open Question 3: simpler code (no size branch), avoids the 4 MB limit entirely, and handles both small and large files through the same path.
- MSAL's `invalid_grant` is detected via `result.get("error") == "invalid_grant"` — consistent with Assumption A3 in RESEARCH.md. The MSAL library returns a dict (never raises), so field-level checking is the correct approach.
## Deviations from Plan
None — plan executed exactly as written. Both backends implemented per the action specifications, all acceptance criteria met.
## Issues Encountered
`google-api-python-client`, `google-auth-oauthlib`, and `msal` were not installed in the local Python 3.9.6 environment (they were added to `requirements.txt` in Plan 05-01 but not installed locally). Installed all three via `pip3 install` to enable local test execution. This is consistent with the Plan 05-02 SUMMARY's note about running tests locally vs. Docker.
FutureWarnings from `google.auth` about Python 3.9 end-of-life appeared in pytest output but do not affect test results — they are informational warnings from the library, not from our code.
## Known Stubs
None. Both backends are fully implemented with real method bodies. No placeholder returns or TODO comments in production code paths.
## Threat Surface Scan
No new network endpoints introduced. Both backends are pure library classes:
- `GoogleDriveBackend` makes outbound calls to `googleapis.com` using OAuth tokens from the decrypted credentials dict. Credentials are not logged.
- `OneDriveBackend` makes outbound calls to `graph.microsoft.com` and `login.microsoftonline.com` (via MSAL). Credentials are not logged.
No new trust boundaries not already documented in the plan's `<threat_model>`. All STRIDE mitigations listed are implemented:
- T-05-03-01: Credentials dict never logged; only in memory during request lifecycle
- T-05-03-02: `invalid_grant` detection implemented; `CloudConnectionError(reason="invalid_grant")` propagated to API layer
- T-05-03-05: `cache_discovery=False` implemented on Drive `build()` call
No threat flags raised.
## Next Phase Readiness
- Both OAuth cloud backends are complete and importable. Plan 05-05 (`cloud.py` API layer) can import `GoogleDriveBackend`, `OneDriveBackend`, and `CloudConnectionError` directly.
- The `get_storage_backend_for_document()` factory in `storage/__init__.py` (Plan 05-02) already has lazy imports for both backends; the `# type: ignore[import]` comments can be resolved once Plan 05-05 adds the actual cloud router.
- 32 new tests in `test_cloud_backends.py` are all green.
- Full suite: 262 passed / 43 xfailed / 1 pre-existing failure (`test_extract_docx` — python-docx not installed locally).
## Self-Check: PASSED
Files verified present:
- `backend/storage/google_drive_backend.py`: FOUND
- `backend/storage/onedrive_backend.py`: FOUND
- `backend/tests/test_cloud_backends.py`: FOUND
Commits verified:
- 4efe7c1: test(05-03): add RED phase tests — FOUND
- 337ee8e: feat(05-03): implement GoogleDriveBackend — FOUND
- bcb887e: feat(05-03): implement OneDriveBackend — FOUND
---
*Phase: 05-cloud-storage-backends*
*Completed: 2026-05-28*