Files
kite/.planning/codebase/CONCERNS.md
T
2026-06-02 15:32:06 +02:00

26 KiB
Raw Blame History

Codebase Concerns

Analysis Date: 2026-06-02


Security Concerns

JWT Algorithm Downgrade: HS256 Instead of ES256

  • Risk: CLAUDE.md specifies ES256 (asymmetric ECDSA P-256) as the required algorithm, but the implementation uses HS256 (symmetric HMAC-SHA256).
  • Files: backend/services/auth.py lines 99, 109, 132, 141
  • Impact: A leaked SECRET_KEY allows arbitrary token forgery. With HS256 any party that has the secret can forge access tokens, impersonate admin users, and bypass all auth checks. ES256 would require the private key for forgery while the public key could safely be distributed for verification.
  • Fix approach: Generate an ECDSA P-256 key pair, store the private key in an env var (JWT_PRIVATE_KEY), store the public key as JWT_PUBLIC_KEY. Update create_access_token to use algorithm="ES256" and decode_access_token / decode_password_reset_token to use the public key. Rotate all active refresh tokens after deploy.
  • Priority: HIGH

No JTI Claim and No JTI Revocation in Redis

  • Risk: CLAUDE.md mandates JTI (JWT ID) in every access token stored in Redis for revocation, but the create_access_token function emits no jti claim and there is no check in get_current_user.
  • Files: backend/services/auth.py (create_access_token), backend/deps/auth.py (get_current_user)
  • Impact: Deactivated users can continue using valid access tokens until TTL expiry (up to 15 minutes). Password changes and account deactivations do not immediately invalidate active sessions (only refresh tokens are revoked — not the live access token in the client's Pinia store).
  • Fix approach: Add jti=str(uuid.uuid4()) to the access token payload. In get_current_user, after successful decode, check await redis.get(f"jti_revoked:{jti}") and raise 401 if set. Add a revoke_access_token(jti, ttl) helper called from account deactivation and password change.
  • Priority: HIGH

No Token Fingerprint / Token Binding

  • Risk: CLAUDE.md requires a fgp (fingerprint) claim = HMAC of User-Agent + Accept-Language, validated on every request. This is absent.
  • Files: backend/services/auth.py, backend/deps/auth.py
  • Impact: Stolen access tokens can be replayed from any device/browser. Token binding would limit the window of a stolen token attack.
  • Fix approach: On login, compute fgp = hmac.new(key, (user_agent + accept_lang).encode(), sha256).hexdigest()[:16]. Embed in JWT payload. In get_current_user, recompute and compare with hmac.compare_digest.
  • Priority: MEDIUM

Password Change Does Not Revoke Active Sessions

  • Risk: POST /api/auth/change-password updates password_hash and writes an audit log but never calls revoke_all_refresh_tokens. CLAUDE.md mandates "Password change… immediately revoke all active sessions."
  • Files: backend/api/auth.py lines 446495
  • Impact: An attacker who has a valid refresh cookie can continue rotating tokens even after the account owner changes their password.
  • Fix approach: Add await auth_service.revoke_all_refresh_tokens(session, current_user.id) after the password hash update, before session.commit(), and also invalidate all JTIs for that user in Redis (once JTI is implemented).
  • Priority: HIGH

TOTP Disable Does Not Revoke Active Sessions

  • Risk: DELETE /api/auth/totp clears the TOTP secret and disables TOTP but does not call revoke_all_refresh_tokens. CLAUDE.md mandates revocation on "TOTP enroll/revoke."
  • Files: backend/api/auth.py lines 587616
  • Impact: An attacker who triggered TOTP removal (via CSRF or compromised session) and has a refresh token continues to operate as an authenticated user with no second factor.
  • Fix approach: Add await auth_service.revoke_all_refresh_tokens(session, current_user.id) in disable_totp before session.commit().
  • Priority: HIGH

Health Endpoint Exposes Internal Error Details Without Auth

  • Risk: GET /health returns full Python exception class names and messages (e.g. "error: OperationalError: (psycopg.OperationalError) …") with no authentication requirement. The comment at line 144 (T-01-05-03) acknowledges this but defers the fix to "Phase 2."
  • Files: backend/main.py lines 136167
  • Impact: Exposes DB driver versions, hostnames, and connection string fragments to unauthenticated callers. Information useful for targeted attacks.
  • Fix approach: Replace f"error: {type(e).__name__}: {e}" with "error" in non-debug mode. Log the detail server-side only. Optionally require admin Bearer token for the detailed form.
  • Priority: MEDIUM

Default Secrets Shipped in Code

  • Risk: backend/config.py hardcodes secret_key = "CHANGEME", cloud_creds_key = "CHANGEME-32-bytes-padded!!", and minio_secret_key = "changeme_minio_app" as Pydantic field defaults.
  • Files: backend/config.py lines 31, 61, 21
  • Impact: If deployed without overriding env vars, production tokens are signed with the known CHANGEME key, all cloud credentials can be decrypted by anyone with the source code, and MinIO uses a known password. Critical misconfiguration vector.
  • Fix approach: Change defaults to "" and add a startup validator (@model_validator(mode="after")) that raises ValueError when these fields equal their placeholder values in production (DEBUG=false). Log a WARNING in dev if the default is detected.
  • Priority: HIGH

default_storage_backend Not Validated Against Allowlist

  • Risk: PATCH /api/users/me/default-storage accepts body.backend as a free string and writes it directly to the DB with no allowlist validation.
  • Files: backend/api/cloud.py lines 927946
  • Impact: A user can set default_storage_backend to any arbitrary string. A future code path using it as a routing key could allow bypassing the _CLOUD_PROVIDERS allowlist.
  • Fix approach: Validate body.backend in {"minio", "google_drive", "onedrive", "nextcloud", "webdav"} before the DB write. Use a Literal type or @field_validator on DefaultStorageRequest.
  • Priority: MEDIUM

X-Forwarded-For Trusted for IP Rate Limiting Without Proxy Enforcement

  • Risk: The IP-level rate limiter (slowapi) uses get_remote_address which reads X-Forwarded-For. Without a trusted reverse proxy normalizing this header, an attacker can bypass the IP rate limit.
  • Files: backend/api/auth.py line 44; backend/deps/utils.py
  • Impact: Attackers can bypass the 10 req/min IP-level limit on login, register, and TOTP endpoints by spoofing the forwarded IP on each request.
  • Fix approach: In Docker Compose, front the backend with nginx configured to set X-Forwarded-For from $remote_addr, stripping any client-supplied value. Document this as a mandatory production requirement.
  • Priority: HIGH

  • Risk: send_password_reset_email builds an HTML body via f-string with reset_link directly in an <a href='…'> attribute without HTML-escaping.
  • Files: backend/services/email.py line 47; line ~105 (security alert email)
  • Impact: If reset_link contains a single-quote (possible under certain URL encoding), the HTML attribute breaks. Low-severity HTML injection risk that violates defense-in-depth.
  • Fix approach: Use html.escape(reset_link, quote=True) when embedding the link in the HTML body.
  • Priority: LOW

Audit Log Written After Commit in delete_folder

  • Risk: In DELETE /api/folders/{folder_id}, session.commit() is called at line 424 and write_audit_log() is called at line 426 — after the commit, in a separate implicit transaction.
  • Files: backend/api/folders.py lines 424435
  • Impact: If the audit log write fails (DB error, constraint violation), the folder is already deleted with no audit record. Inconsistent with the WR-08 pattern used by delete_document (auto_commit=False).
  • Fix approach: Move write_audit_log() before session.commit(), following the pattern used in api/documents.py::delete_document.
  • Priority: MEDIUM

OAuth Callback Error Redirect May Leak Internal Exception Details

  • Risk: In oauth_callback, the except Exception as exc block redirects to frontend_url/settings?cloud_error={urllib.parse.quote(str(exc))}. Exception strings from google-auth-oauthlib or msal may include OAuth client secrets, state values, or internal URL fragments.
  • Files: backend/api/cloud.py lines 541546
  • Impact: Exception details appear in the browser URL bar, referrer headers, browser history, and server access logs.
  • Fix approach: Map exception types to user-safe generic messages ("auth_failed", "connection_error"). Log the real exception server-side at ERROR level. Only pass an opaque error code in the redirect.
  • Priority: MEDIUM

Performance Concerns

N+1 Query Pattern in list_metadata / list_documents

  • Risk: services/storage.py::list_metadata loads all documents then calls _load_topic_names(session, doc.id) in a Python loop — one DB round-trip per document. The same pattern repeats in the list_documents handler's non-legacy code path.
  • Files: backend/services/storage.py lines 136139; backend/api/documents.py lines 501506
  • Impact: For a user with 100 documents, a single list request issues 101 DB queries. At 1000 documents, 1001 queries. Response time degrades linearly as the library grows.
  • Fix approach: Replace with a single JOIN query using PostgreSQL's array_agg(t.name) grouped by document. Or use a subquery fetching all document-topic associations for the user in one query and merging in Python.
  • Priority: HIGH

Entire File Loaded into Memory for Download and Task Processing

  • Risk: GET /api/documents/{id}/content calls await storage_backend.get_object(…) which returns full bytes, loads them into a list, and returns StreamingResponse(iter([file_bytes])). The Celery extraction task also buffers the full file.
  • Files: backend/api/documents.py lines 792, 827831; backend/tasks/document_tasks.py line 74
  • Impact: A 100 MB file consumes 100 MB of heap per concurrent request. With 10 simultaneous downloads, the worker needs 1 GB just for file buffers. The 100 MB quota mitigates this today but does not scale.
  • Fix approach: For MinIO, return presigned GET URLs with short TTL instead of proxying through FastAPI. For cloud backends, pipe the provider HTTP response stream directly. For Celery extraction, stream text extraction from bytes in chunks.
  • Priority: MEDIUM

No Upload Size Pre-Validation

  • Risk: POST /api/documents/upload (cloud path) reads the entire file via await file.read() before any quota or size check. FastAPI has no global max_upload_size configured.
  • Files: backend/api/documents.py line 207
  • Impact: A malicious user can upload a multi-gigabyte file, exhausting FastAPI worker memory before the quota check fires.
  • Fix approach: Check Content-Length header at endpoint entry; reject with 413 if above a configurable MAX_UPLOAD_BYTES limit. Add a --limit-max-requests or body-size middleware at the uvicorn/nginx level.
  • Priority: MEDIUM

revoke_all_refresh_tokens Issues One UPDATE per Token

  • Risk: services/auth.py::revoke_all_refresh_tokens loads all active refresh token rows into Python, then marks each revoked=True individually via ORM, issuing one UPDATE statement per token.
  • Files: backend/services/auth.py lines 218237
  • Impact: A user with many active sessions (e.g. 50 devices) causes 50 individual UPDATE statements on sign-out-all. Could be replaced with a single bulk UPDATE.
  • Fix approach: Replace with UPDATE refresh_tokens SET revoked = true WHERE user_id = :uid AND revoked = false and count affected rows via result.rowcount.
  • Priority: LOW

FTS Falls Back Silently on Any Exception

  • Risk: The FTS code path in list_documents wraps the FTS query in except Exception: and falls back to an unfiltered query.
  • Files: backend/api/documents.py lines 486489
  • Impact: Any PostgreSQL error causes silent fallback — the user sees all their documents when they searched for a term, with no indication of failure.
  • Fix approach: Narrow the catch to sqlalchemy.exc.OperationalError (for SQLite compat in tests only) and log all other exceptions at ERROR level before re-raising.
  • Priority: MEDIUM

Reliability Concerns

Email Queue Worker Missing From Docker Compose

  • Risk: Celery routes email tasks to the email queue but docker-compose.yml defines only one Celery worker consuming -Q documents. No worker processes the email queue.
  • Files: backend/celery_app.py line 36; docker-compose.yml line 96
  • Impact: Password reset emails, security alert emails (refresh token reuse detection), and backup code emails are silently enqueued but never delivered. Callers receive 202 but emails never arrive.
  • Fix approach: Add a celery-worker-email service in docker-compose.yml consuming -Q email, or update the existing worker command to -Q documents,email.
  • Priority: HIGH

documents.updated_at Not Auto-Updated on Row Changes

  • Risk: Document.updated_at is declared with server_default=func.now() but no onupdate trigger. When extracted_text, status, or filename is changed, updated_at stays as the creation timestamp.
  • Files: backend/db/models.py lines 192194
  • Impact: classified_at in _doc_to_dict is computed from doc.updated_at when status == "classified" — if updated_at is stale, the displayed timestamp is incorrect. Sort-by-date after reclassification is also wrong.
  • Fix approach: Add a PostgreSQL BEFORE UPDATE trigger that sets updated_at = now(), or add onupdate=func.now() to the mapped column (requires SQLAlchemy ORM event to fire at update time).
  • Priority: MEDIUM

Celery Task Result Backend Accumulates Without Expiry

  • Risk: Celery is configured to use Redis as result backend with no result_expires setting. Task results accumulate in Redis indefinitely.
  • Files: backend/celery_app.py lines 2324
  • Impact: Redis memory grows unboundedly over time, potentially causing OOM which would also break rate limiting and TOTP replay prevention.
  • Fix approach: Add celery_app.conf.result_expires = 3600 or disable the result backend entirely since no code reads task results.
  • Priority: MEDIUM

Breadcrumb Builder in get_folder Has No Depth Limit

  • Risk: The breadcrumb builder in GET /api/folders/{folder_id} walks up the parent chain iteratively with a visited set but no maximum depth cap.
  • Files: backend/api/folders.py lines 234247
  • Impact: With a deeply nested folder tree (e.g. 200 levels of nesting), the loop issues 200 sequential DB round-trips before terminating.
  • Fix approach: Add if len(crumbs) >= 20: break to cap at a reasonable depth.
  • Priority: LOW

Code Quality Concerns

Duplicate Inline IP Extraction (Not Using get_client_ip)

  • Risk: Several endpoints extract the client IP inline instead of using deps/utils.py::get_client_ip().
  • Files: backend/api/documents.py lines 269271, 376; backend/api/cloud.py lines 624, 753
  • Impact: If the trusted-proxy logic changes, all inline copies must be updated individually.
  • Fix approach: Replace all inline request.headers.get("X-Forwarded-For") or request.client.host with get_client_ip(request).
  • Priority: LOW

Document.status Is an Unconstrained String Column

  • Risk: Document.status is String, nullable=False, default="pending" with no DB-level CHECK constraint or Python enum. Values "pending", "uploaded", "classified", "classification_failed" are used in code but not enforced.
  • Files: backend/db/models.py line 188
  • Impact: A typo in a task or direct DB write silently sets an invalid status, causing silent bugs in status-checking code (e.g. classified_at timestamp never shown).
  • Fix approach: Add a migration with ALTER TABLE documents ADD CONSTRAINT ck_documents_status CHECK (status IN ('pending', 'uploaded', 'classified', 'classification_failed')).
  • Priority: LOW

Stale Wave-2 Comment in documents.py Module Docstring

  • Risk: backend/api/documents.py lines 1920 contain "NOTE (Wave 2): No auth guards on any endpoint yet — Plan 03-03 adds get_current_user…" — this is false; all handlers use get_regular_user.
  • Files: backend/api/documents.py lines 1920
  • Impact: Misleads reviewers into thinking auth is not applied, potentially causing incorrect security assessments.
  • Fix approach: Remove or replace the stale NOTE comment.
  • Priority: LOW

classify_document Endpoint Uses Mutable Default and Unvalidated Dict Body

  • Risk: POST /api/documents/{doc_id}/classify has body: dict = {} — mutable default argument antipattern and no Pydantic validation.
  • Files: backend/api/documents.py line 695
  • Impact: Static analysis confusion; unvalidated request body accepts arbitrary JSON keys.
  • Fix approach: Define class ClassifyRequest(BaseModel): topics: Optional[list[str]] = None and replace body: dict = {}.
  • Priority: LOW

Missing Tests / Coverage Gaps

No Tests for JWT Algorithm, JTI, or Token Binding

  • Risk: No tests verify the JWT algorithm, JTI presence/validation, or token binding.
  • Files: backend/tests/test_auth_deps.py, backend/tests/test_auth_api.py
  • Impact: Algorithm or claim changes would not be caught. The security invariants are untested.
  • Priority: HIGH

Password Change Has No Session-Revocation Test

  • Risk: No test verifies that changing a password invalidates existing refresh tokens.
  • Files: backend/tests/test_auth_api.py
  • Fix approach: Add test: register → login (obtain refresh cookie) → change password → assert old refresh cookie returns 401.
  • Priority: HIGH

No Frontend E2E Tests

  • Risk: The frontend has Vitest unit tests for 3 stores and a small set of components, but no Playwright or Cypress E2E tests for critical user flows.
  • Files: frontend/src/stores/__tests__/, frontend/src/views/__tests__/
  • Impact: Breaking changes in API contract, router guards, or component interactions are not caught until manual testing. The upload flow, TOTP enrollment, and admin operations have no automated coverage.
  • Fix approach: Add Playwright E2E tests for: login → upload → view → share → recipient download.
  • Priority: MEDIUM

No Regression Test for delete_folder Audit Log Ordering

  • Risk: The audit log after-commit ordering issue in delete_folder has no test to prevent regression after fixing.
  • Files: backend/tests/test_folders.py
  • Priority: MEDIUM

Quota Concurrency Tests Run Against SQLite, Not PostgreSQL

  • Risk: Quota enforcement tests run against SQLite in the default test config. CLAUDE.md specifies "integration tests against real PostgreSQL (not SQLite for quota/UUID tests)."
  • Files: backend/tests/test_quota.py; backend/tests/conftest.py
  • Impact: A race condition in the atomic quota UPDATE would only be detectable with concurrent clients on real PostgreSQL.
  • Fix approach: Mark quota atomicity tests with @pytest.mark.skipif(not live_services_available, ...) and add a concurrent-upload test using asyncio.gather.
  • Priority: MEDIUM

Pinia Stores documents.js and topics.js Have No Unit Tests

  • Risk: The stores for documents and topics — which implement pagination, filtering, and topic assignment logic — have no tests in frontend/src/stores/__tests__/.
  • Files: frontend/src/stores/documents.js, frontend/src/stores/topics.js
  • Priority: LOW

Dependency Risks

All Backend Dependencies Use Floor >= Version Pins

  • Risk: backend/requirements.txt uses >= for all packages including security-critical ones: PyJWT>=2.8.0, pwdlib[argon2]>=0.2.1, cryptography>=41.0.0, fastapi>=0.111.
  • Files: backend/requirements.txt
  • Impact: pip install resolves to the latest available version at build time. A breaking change or vulnerability in any dependency silently takes effect on the next Docker build. CLAUDE.md mandates exact version pinning for security-critical packages.
  • Fix approach: Run pip freeze > requirements.lock to generate an exact pinned lockfile. Use pip-tools or uv lock to manage upgrades. At minimum, pin PyJWT, pwdlib, cryptography, and fastapi to exact versions.
  • Priority: HIGH

minio/minio:latest Tag in Docker Compose

  • Risk: docker-compose.yml uses image: minio/minio:latest — a floating tag that pulls a new release on docker compose pull.
  • Files: docker-compose.yml line 19
  • Impact: Breaking MinIO API changes or security regressions in a new release could break file storage without warning.
  • Fix approach: Pin to a specific MinIO release tag (e.g. minio/minio:RELEASE.2024-11-07T00-52-20Z).
  • Priority: MEDIUM

Infrastructure and Operational Concerns

No Reverse Proxy / TLS Termination in Production Setup

  • Risk: docker-compose.yml exposes the FastAPI backend on port 8000 and frontend on port 5173 directly, with no nginx or Caddy container for TLS termination or X-Forwarded-For normalization.
  • Files: docker-compose.yml
  • Impact: (1) The refresh cookie uses secure=True in code but travels over plain HTTP, making the secure flag ineffective. (2) IP rate limiting is spoofable. (3) Credentials and session cookies travel in cleartext.
  • Fix approach: Add an nginx service to docker-compose.yml that terminates TLS (Let's Encrypt or self-signed), proxies /api/ to the backend, and sets proxy_set_header X-Forwarded-For $remote_addr.
  • Priority: HIGH

MinIO Uses Plain HTTP Between Containers

  • Risk: Minio(…, secure=False) — all object data travels over HTTP between FastAPI and MinIO containers.
  • Files: backend/main.py line 82; backend/storage/__init__.py line 48
  • Impact: An attacker with access to the Docker network can intercept document bytes in transit. Critical if containers share a host with untrusted workloads.
  • Fix approach: Enable TLS on MinIO (secure=True) or document the trust model explicitly. For shared-host deployments, configure mTLS between containers.
  • Priority: MEDIUM (acceptable on isolated Docker bridge; critical on shared host)

No Backup Strategy for PostgreSQL or MinIO Data

  • Risk: docker-compose.yml uses named volumes (postgres_data, minio_data) with no backup tooling, retention policy, or point-in-time recovery.
  • Files: docker-compose.yml lines 138140
  • Impact: A disk failure, container wipe, or accidental docker volume rm causes permanent loss of all user documents, credentials, audit logs, and accounts.
  • Fix approach: Add a backup service running pg_dump on a schedule (e.g. via ofelia or a cron sidecar), compressing and shipping to an off-site store. Configure MinIO mc mirror to a second bucket or provider. Document RTO/RPO targets.
  • Priority: HIGH

Redis Has No Persistence Configuration

  • Risk: Redis is started with only --requirepass. No --save or --appendonly yes flags are set, making all Redis data ephemeral.
  • Files: docker-compose.yml line 42
  • Impact: A Redis restart clears all rate-limit counters (brief brute-force window on auth endpoints), TOTP replay prevention keys (30-second replay window reopens), and pending OAuth state tokens.
  • Fix approach: Add --save 60 1 --appendonly yes to the Redis command and mount a Redis data volume. Document that Redis restart is a brief security event requiring monitoring.
  • Priority: MEDIUM

Docker Compose Mounts Source Code as Live Volume

  • Risk: docker-compose.yml mounts ./backend:/app and ./frontend/src:/app/src as live volumes (appropriate for dev hot-reload but dangerous in production if the same file is used).
  • Files: docker-compose.yml lines 5354, 131132
  • Impact: In production, host filesystem modifications immediately affect the running container without a deploy cycle.
  • Fix approach: Create a docker-compose.prod.yml that omits the volume mounts and uses the Dockerfile COPY . . layer only. Document the two-file strategy clearly.
  • Priority: MEDIUM

Dockerfile Runs Application as Root

  • Risk: backend/Dockerfile uses FROM python:3.12-slim with no USER directive. FastAPI and Celery run as root inside the container.
  • Files: backend/Dockerfile
  • Impact: A container escape vulnerability or SSRF leading to RCE gives the attacker root-equivalent access to the container filesystem.
  • Fix approach: Add RUN adduser --disabled-password --gecos "" appuser && chown -R appuser /app and USER appuser before EXPOSE 8000.
  • Priority: MEDIUM

No Structured Logging, Metrics, or Alerting

  • Risk: All logging uses Python's stdlib logger with no structured format, no Prometheus/StatsD metrics endpoint, no error aggregation service, and no alerting on security events in the audit log.
  • Files: All backend files
  • Impact: Silent failures — email queue not processing, repeated TOTP replay attempts, brute-force login spikes — go undetected. Failed Celery tasks log to stderr with no aggregation. The security alert email on refresh token reuse is the only active notification mechanism.
  • Fix approach: Add structlog for JSON-formatted structured logs. Add a /metrics endpoint with prometheus-fastapi-instrumentator. Configure alerting on auth.login_failed count spikes in the audit log.
  • Priority: MEDIUM

Concerns audit: 2026-06-02