Files
Business-Management/features/doc-service/STATUS.md
T
curo1305 94901fc30f Redesign doc service UX for scale + add group-based document sharing
- Three-column layout: Sidebar + SourcePanel (views + searchable category tree) + main
- DocumentSlideOver (480px right panel): inline editing, type picker, AI suggestion confirm/reject,
  categories combobox, tags editor, sharing section, raw text, re-analyse/delete actions
- ManageCategoriesDialog: inline rename, delete with confirm, search filter
- DocumentsPage rewrite: filter chip system, multi-file upload queue, drag-and-drop overlay,
  bulk actions bar (share/delete), smart TanStack Query polling, URL-driven view state
- Sidebar simplified: per-category NavLinks removed; Documents = single NavLink under Apps
- Backend: document_shares table (migration 0004), share CRUD endpoints, shared-with-me view,
  N+1-safe share_count via GROUP BY, recipient download access, X-User-Groups header enforcement
- Gateway proxy: injects X-User-Groups header into all document + category proxy requests
- Backend users: GET /api/users/me/groups endpoint for share picker combobox
- CLAUDE.md, STATUS.md files, and changelog updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 12:46:43 +02:00

11 KiB
Raw Blame History

Doc Service — Status

What it is

PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis. Also supports automatic ingestion from a mounted watch directory (NAS, Nextcloud, Syncthing, etc.).

Port: 8001 (internal only, not exposed to host). All traffic arrives via the backend proxy (backend/app/routers/documents_proxy.py), which injects the authenticated x-user-id header.

Database: shared PostgreSQL instance, isolated via alembic_version_doc_service Alembic version table. Storage: /data/documents/ (Docker named volume doc_data). Watch directory: /data/watch (named volume watch_data in prod; bind-mount in dev via docker-compose.dev.yml).


Current functionality

Document lifecycle

  1. POST /documents/upload — validate PDF, persist file to /data/documents/{user_id}/{doc_id}.pdf, create DB row with status=pending, enqueue background extraction
  2. Background task: extract text with pdfplumber → POST to ai-service /chat → parse JSON result → update status=done (or failed)
  3. AI extracts: title, document_type, tags, suggested_categories, plus domain fields (vendor, customer, dates, amounts, etc.) into extracted_data (JSON string)

Endpoints

Method Path Description
POST /documents/upload Upload PDF; returns 202 with initial doc row
GET /documents Paginated list with filters, sort, and optional category_id filter
GET /documents/{id} Single document
GET /documents/{id}/status Lightweight status poll
GET /documents/{id}/download Stream file bytes
DELETE /documents/{id} Delete document and file
PATCH /documents/{id}/type Update document type
PATCH /documents/{id}/tags Replace tag list (dedup, preserve order)
PATCH /documents/{id}/title Update editable title
GET /documents/categories List all categories (user + watch)
POST /documents/categories Create a category; triggers re-analysis of documents in similar categories
POST /documents/{id}/reprocess Reset status to pending and re-run AI extraction; 409 if already pending/processing
PATCH /documents/categories/{id} Rename a category
DELETE /documents/categories/{id} Delete a category
POST /documents/{id}/categories/{cat_id} Assign category to document
DELETE /documents/{id}/categories/{cat_id} Remove category from document
POST /documents/{id}/suggestions/folder/confirm Apply AI folder suggestion → create/find category + assign
POST /documents/{id}/suggestions/folder/reject Clear AI folder suggestion
POST /documents/{id}/suggestions/filename/confirm Apply AI filename suggestion → set title
POST /documents/{id}/suggestions/filename/reject Clear AI filename suggestion

Plugin endpoints (internal — backend calls only)

Method Path Description
GET /plugin/manifest Static manifest: metadata, JSON Schema for settings, access rules
GET /plugin/settings Current watch/storage config values
PATCH /plugin/settings Update watch/storage config (persisted to /config/doc_service_config.json)

Pagination & filtering (GET /documents)

Query params:

Param Default Notes
page 1 ≥ 1
per_page 20 1100
sort created_at created_at, processed_at, filename, title, file_size, status, document_type
order desc asc | desc
status filter by status string
document_type filter by document type
search case-insensitive ILIKE on title, filename, tags, document_type
category_id filter to documents assigned to this category UUID

Response: { items: [...], total: N, page: N, pages: N }

Document schema

id                UUID
user_id           string (from x-user-id header; "watch" for watch-ingested docs)
filename          original filename
title             AI-suggested editable title (nullable)
file_size         bytes
status            pending | processing | done | failed
document_type     AI-classified type (nullable)
extracted_data    JSON string — all AI-extracted fields
tags              JSON array string — editable tags
error_message     set if status=failed
created_at        upload timestamp
processed_at      when extraction finished
source            "upload" (default) or "watch"
watch_path        original absolute path in watch directory (nullable)
suggested_folder  AI-suggested category name, pending user confirm (nullable)
suggested_filename AI-suggested title/rename, pending user confirm (nullable)
categories        many-to-many via category_assignments

Watch-ingested documents (user_id = "watch") are visible to all authenticated users.

AI extraction (via ai-service)

System prompt and user prompt template are loaded at runtime from doc_service_config.json (system_prompts key). Defaults are built into the service and used as fallback if the config key is absent. Changes made via the AI Settings UI take effect within 30 seconds (config cache TTL).

Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:

  • title — suggested human-readable title
  • document_type — invoice / bill / receipt / order / expense / revenue / unknown
  • tags — list of keyword tags
  • suggested_categories — list of category names to suggest in the UI
  • Domain fields: vendor, customer, invoice_number, due_date, total_amount, currency, etc.

Config (runtime, persisted to shared volume)

/config/doc_service_config.json:

{ "documents": { "max_pdf_bytes": 20971520 } }

Env override: DOC_MAX_PDF_MB

Watch directory feature

Controlled via plugin settings (UI accessible to superusers and doc-service-admin group members):

  • watch_enabled — toggle file watching (default: false)
  • watch_path — mount point (read-only, /data/watch; override via Docker volume)
  • ai_folder_suggestion — AI suggests a category for each ingested doc (user confirms)
  • ai_folder_default — default category when AI suggestion is disabled
  • ai_rename_suggestion — AI suggests a title for each ingested doc (user confirms)

On startup scan, the watcher walks the watch directory and ingests any PDFs not already in the database (idempotency check by watch_path). Subfolders are automatically mapped to categories (e.g. watch/invoices/bill.pdf → category "invoices"). No-remove policy: deleting a file from the watch directory does not delete the document record.

Document sharing (document_shares)

Group-based sharing allows a document owner to share a document with all members of any group they belong to. Recipients can view and download the shared document; they cannot edit, re-analyse, delete, or re-share it.

The gateway injects X-User-Groups: <group_id1>,<group_id2>,... alongside the existing X-User-Id header, so doc-service can evaluate group access without querying the backend DB.

Method Path Auth Description
GET /documents/shared-with-me Any user Documents shared with the user via their groups; excludes own docs
GET /documents/{id}/shares Owner only List all groups the document is shared with
POST /documents/{id}/shares Owner only Share with a group ({group_id} in body); group must be in X-User-Groups
DELETE /documents/{id}/shares/{group_id} Owner only Stop sharing with that group

DocumentOut now includes share_count: int — the number of groups the document is shared with.

GET /documents/{id}/file also allows access to shared documents (recipients can download).

Database migrations

Revision Description
0001 Initial schema (documents, categories, category_assignments)
0002 Add title column to documents
0003 Add source, watch_path, suggested_folder, suggested_filename columns
0004 Add document_shares table (document_id, group_id, shared_by_user_id, created_at)

Run automatically on container start via alembic upgrade head.


Architecture

backend (proxy)  →  doc-service:8001
                        │
           ┌────────────┼────────────────────┐
      documents.py    categories.py        plugin.py
           │               │             (internal only)
  ┌────────┴────────┐
upload           list/get/patch/suggest
  │
save_upload()        pdfplumber extraction
  │                    │
Document(status=pending)   ai_client.classify_document()
  │                    │
BackgroundTask         ai-service:8010/chat
  │                    │
process_document()   JSON result → update doc row

file_watcher.py (watchdog Observer, daemon thread)
  │
  ├── _PdfEventHandler.on_created / on_moved
  │       └── asyncio.run_coroutine_threadsafe(ingest_file, loop)
  │
  └── _scan_existing() on startup (catches offline gaps)

Known limitations / not implemented

  • Re-process — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
  • Advanced field-level searchsearch param matches text fields via ILIKE but does not query into extracted_data JSON (e.g. filter by vendor or due_date)
  • Bulk operations — no bulk category assign/remove endpoint (frontend handles bulk delete/share individually)
  • Advanced field-level searchsearch matches text fields via ILIKE but does not query into extracted_data JSON
  • Pagination in categories — categories are returned as a full list (no pagination)
  • File type — only PDF supported
  • Concurrent uploads — no rate limiting per user

Future work

  • POST /documents/{id}/reprocess — re-run AI extraction
  • Watch directory feature with file watcher, startup scan, folder-to-category mapping, AI suggestion toggles
  • Plugin manifest endpoint (/plugin/manifest, /plugin/settings) for generic settings UI
  • Advanced filter: query extracted_data JSON fields (vendor, due_date, amount) — requires PostgreSQL jsonb column or indexed virtual columns
  • Bulk operations endpoint
  • Document sharing via groups — document_shares table + share endpoints + shared-with-me view
  • Frontend UI for suggestion badges (suggested_folder / suggested_filename confirm/reject buttons in slide-over)
  • Advanced filter: query extracted_data JSON fields (vendor, due_date, amount)
  • Support additional file types (images via OCR, DOCX)
  • Rate limiting on upload endpoint
  • Soft delete with restore
  • Edit rights for shared recipients (V2)