# DocuVault — Claude Code Guide ## Project Overview DocuVault is a multi-user SaaS document management platform built on FastAPI (Python) + Vue 3. It handles document upload, text extraction (PDF/DOCX/image/text), AI-based topic classification, per-user isolated storage, folder organization, document sharing, and pluggable cloud storage backends (OneDrive, Google Drive, Nextcloud, WebDAV). **Current state:** Brownfield — single-user app is functional. Active milestone: migrating to multi-user, adding auth, PostgreSQL + MinIO, and cloud storage. ## Stack - **Backend:** Python 3.12, FastAPI 0.136+, SQLAlchemy 2.0 async, psycopg v3, Alembic, MinIO SDK - **Frontend:** Vue 3 (Options API), Pinia, Vue Router 4, Vite, Tailwind CSS - **Infrastructure:** Docker Compose, PostgreSQL, MinIO (S3-compatible) - **Auth:** PyJWT 2.12+, pwdlib[argon2], pyotp (TOTP), cryptography (Fernet/HKDF) ## Key Architectural Rules - JWT access token lives in **Pinia memory only** — never localStorage or sessionStorage - Refresh token is an **httpOnly; Secure; SameSite=Strict cookie** — never accessible to JavaScript - MinIO object keys are **UUID-based** (`{user_id}/{document_id}/{uuid4()}{ext}`) — human filenames in DB only - Cloud credentials encrypted with **HKDF per-user key derivation** — master key in env var only - Quota enforced atomically: **`UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes`** - Admin endpoints **never return** document content, extracted text, or `credentials_enc` - Every document/folder endpoint asserts `resource.user_id == current_user.id` - All DB queries via ORM / parameterized statements — zero raw string interpolation ## GSD Workflow This project uses the GSD (Get Shit Done) planning workflow. Planning artifacts live in `.planning/`. ### Key files | File | Purpose | |---|---| | `.planning/ROADMAP.md` | 5-phase plan with success criteria | | `.planning/REQUIREMENTS.md` | 54 v1 requirements with REQ-IDs | | `.planning/STATE.md` | Current phase and completion status | | `.planning/PROJECT.md` | Project context and key decisions | | `.planning/research/SUMMARY.md` | Domain research synthesis | | `.planning/codebase/` | Codebase map (architecture, stack, concerns) | ### Commands ``` /gsd:discuss-phase N — gather context before planning a phase /gsd:plan-phase N — create execution plan for a phase /gsd:execute-phase N — execute the plan /gsd:verify-work N — verify phase deliverables against requirements /gsd:progress — check status and advance workflow ``` ### Current phase: Not started — run `/gsd:discuss-phase 1` to begin ## Development Setup ```bash # Start all services docker compose up # Backend only (local dev) cd backend && uvicorn main:app --reload # Frontend only (local dev) cd frontend && npm run dev # Run backend tests cd backend && pytest -v ``` ## Testing Protocol (Non-Negotiable) Every feature, function, and bug fix requires tests. No phase or plan may advance until all tests pass. ### Rules - **Coverage**: Every new function, endpoint, and UI component must have at least one test — unit for isolated logic, integration for DB/service boundaries, E2E for critical user flows - **Gate**: `pytest -v` (backend) and frontend test suite must pass with zero failures before marking a plan complete or advancing to the next phase - **Bug fixes**: Must fix the root cause, not work around it. Maximum 50 lines of changed code per fix. If a fix requires more, it is scope-creep and must be broken into a separate plan - **No workarounds**: `# type: ignore`, `noqa`, skipping a test, or adding a `try/except` that silently swallows an error are prohibited as bug fixes - **Regression**: Any time a bug is fixed, a test must be added that would have caught it ### Test types per layer | Layer | Required test type | |---|---| | Service / business logic | Unit tests with mocked dependencies | | DB queries / ORM | Integration tests against real PostgreSQL (not SQLite for quota/UUID tests) | | API endpoints | `httpx.AsyncClient` integration tests with real DB fixtures | | Auth flows | Full round-trip tests (register → login → TOTP → refresh → revoke) | | Security invariants | Dedicated negative tests (wrong owner → 403/404, admin → 403, replay → 401) | | Frontend | Vitest unit tests for stores/composables; Playwright or Cypress for critical flows | --- ## Security Protocol (Non-Negotiable) A dedicated **security agent** runs after every plan execution and before any phase is marked complete. This agent has full read/write/edit access to the entire codebase and is the final gate before advancement. ### Security agent mandate The security agent must check — and fix — every class of vulnerability listed below. It may not flag and defer; it must resolve or escalate blocking issues. #### OWASP Top 10 + auth-specific | Threat | Required mitigation | |---|---| | SQL injection | All queries via ORM or parameterized statements — zero raw string interpolation | | XSS | CSP headers, `httpOnly` cookies, no `innerHTML` with user data, Vue template auto-escaping never bypassed | | CSRF | `SameSite=Strict` cookie + `Origin`/`Referer` header validation on all state-changing endpoints | | Broken auth | Short-lived JWT (≤15 min), refresh rotation, family revocation on reuse, constant-time comparison | | IDOR / broken access control | Every resource endpoint asserts `resource.user_id == current_user.id`; admin blocked from document content | | Security misconfiguration | No debug mode in production, no stack traces in API responses, no default credentials | | Sensitive data exposure | Passwords hashed Argon2id, PII fields encrypted at rest, `credentials_enc` never in API responses | | Insecure deserialization | No `pickle`, no `eval`, no dynamic `__import__`; all user-supplied data validated via Pydantic | | Vulnerable dependencies | `pip audit` / `npm audit` run; critical/high CVEs blocked | | Insufficient logging | All auth events, quota violations, and admin actions written to audit log without document content | #### Advanced threats - **Path traversal**: All file path construction uses `os.path.basename` / `pathlib` — never joins user-supplied strings directly - **SSRF**: All outbound HTTP (HIBP, cloud OAuth) via an allowlisted client; user-supplied URLs for WebDAV/Nextcloud must pass hostname allowlist - **Timing attacks**: `hmac.compare_digest` / `secrets.compare_digest` for all token, TOTP, and backup-code comparison — no `==` - **Race conditions / TOCTOU**: Quota enforcement via single atomic `UPDATE … RETURNING` — never read-then-write in Python - **Mass assignment**: Pydantic models explicitly declare every accepted field; no `**kwargs` passthrough from request body to ORM - **Privilege escalation**: `get_regular_user` and `get_current_admin` deps checked on every endpoint; no role elevation path exists - **Token replay**: JTI stored in DB; used TOTP codes invalidated within the 90 s window; refresh token family revocation on reuse #### Zero-day / defense-in-depth - **Minimal attack surface**: Every endpoint that is not needed is absent — no commented-out code, no `TODO: remove` endpoints left alive - **Principle of least privilege**: `docuvault_app` DB role has DML only; `docuvault_migrate` has DDL; MinIO bucket policy denies public access - **Secrets in env only**: No credentials, API keys, or signing secrets in code, commits, or `.env` files checked in; `.gitignore` enforces this - **Dependency pinning**: `requirements.txt` and `package-lock.json` pin exact versions; no floating `>=` for security-critical packages (PyJWT, pwdlib, cryptography) - **Container hardening**: Non-root user in Dockerfile, read-only filesystem where possible, no `--privileged` containers - **Header hardening**: `X-Content-Type-Options: nosniff`, `X-Frame-Options: DENY`, `Referrer-Policy: strict-origin-when-cross-origin` on every response ### Database user table encryption Sensitive user PII (email, display name) must be encrypted at the application layer before storage: - Encryption: AES-256-GCM via `cryptography` library, per-row nonce, master key from env var - Key derivation: HKDF-SHA256 with `purpose=b"user-pii"` salt — same pattern as cloud credentials - Admin queries: never return plaintext PII for users other than the requesting user - Indexing: email lookup uses a deterministic HMAC-SHA256 index (`email_hmac` column) — the encrypted column is never used for WHERE clauses ### Login token hardening (state of the art) - **Algorithm**: ES256 (ECDSA P-256) — asymmetric; the private key signs, the public key verifies; a leaked public key cannot forge tokens - **Access token TTL**: 15 minutes maximum - **Refresh token**: 30-day httpOnly Strict cookie; rotated on every use; reuse of a rotated token revokes entire family and fires a security alert email - **JTI claim**: Every token has a unique `jti`; revoked JTIs stored in Redis with TTL matching the token lifetime - **Token binding**: Access token embeds a `fgp` (fingerprint) claim = HMAC of `User-Agent + Accept-Language`; backend validates on every request - **Rotation on privilege change**: Password change, TOTP enroll/revoke, and account deactivation immediately revoke all active sessions ### Security gate checklist (must all pass before phase advances) - [ ] `bandit -r backend/` — zero HIGH severity findings - [ ] `pip audit` — zero critical/high CVEs - [ ] `npm audit --audit-level=high` — zero high/critical vulnerabilities - [ ] All security-invariant tests pass (wrong owner, admin block, token replay, CSRF) - [ ] No new `# noqa: S` suppressions without a documented justification comment - [ ] Admin endpoints verified to never return `password_hash`, `credentials_enc`, or document content - [ ] No hardcoded secrets detected by `git secrets` / `trufflehog` --- ## Security Requirements (Non-Negotiable) - Rate limiting on all auth endpoints (login, register, password reset, TOTP) - Constant-time comparison for all token/code verification - CSRF protection on all state-changing endpoints - Content-Security-Policy headers on all responses - HaveIBeenPwned API check on registration and password change - TOTP replay prevention (mark used codes in DB within validity window) - Refresh token family revocation on token reuse detection - Admin impersonation is an explicit architectural exclusion — no endpoint or code path may exist