Files

T

curo1305 00466a9801 Add generic plugin architecture and watch-directory feature

Introduces a manifest contract so feature containers self-describe their
settings (JSON Schema + access rules). Backend and frontend gain generic
plugin proxy and dynamic Extensions UI with zero feature-specific code.

Doc-service is the first plugin consumer: exposes /plugin/manifest and
/plugin/settings, adds a watchdog-based file watcher that auto-ingests
PDFs from a mounted directory, maps subfolders to categories, supports
AI-suggested folder/filename (user-confirmed), and enforces a no-remove
policy. Access is gated by is_superuser or doc-service-admin group.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-18 02:09:50 +02:00

9.5 KiB

Raw Blame History

Doc Service — Status

What it is

PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis. Also supports automatic ingestion from a mounted watch directory (NAS, Nextcloud, Syncthing, etc.).

Port: 8001 (internal only, not exposed to host). All traffic arrives via the backend proxy (backend/app/routers/documents_proxy.py), which injects the authenticated x-user-id header.

Database: shared PostgreSQL instance, isolated via alembic_version_doc_service Alembic version table. Storage: /data/documents/ (Docker named volume doc_data). Watch directory: /data/watch (named volume watch_data in prod; bind-mount in dev via docker-compose.dev.yml).

Current functionality

Document lifecycle

POST /documents/upload — validate PDF, persist file to /data/documents/{user_id}/{doc_id}.pdf, create DB row with status=pending, enqueue background extraction
Background task: extract text with pdfplumber → POST to ai-service /chat → parse JSON result → update status=done (or failed)
AI extracts: title, document_type, tags, suggested_categories, plus domain fields (vendor, customer, dates, amounts, etc.) into extracted_data (JSON string)

Endpoints

Method	Path	Description
`POST`	`/documents/upload`	Upload PDF; returns 202 with initial doc row
`GET`	`/documents`	Paginated list with filters, sort, and optional `category_id` filter
`GET`	`/documents/{id}`	Single document
`GET`	`/documents/{id}/status`	Lightweight status poll
`GET`	`/documents/{id}/download`	Stream file bytes
`DELETE`	`/documents/{id}`	Delete document and file
`PATCH`	`/documents/{id}/type`	Update document type
`PATCH`	`/documents/{id}/tags`	Replace tag list (dedup, preserve order)
`PATCH`	`/documents/{id}/title`	Update editable title
`GET`	`/documents/categories`	List all categories (user + watch)
`POST`	`/documents/categories`	Create a category; triggers re-analysis of documents in similar categories
`POST`	`/documents/{id}/reprocess`	Reset status to pending and re-run AI extraction; 409 if already pending/processing
`PATCH`	`/documents/categories/{id}`	Rename a category
`DELETE`	`/documents/categories/{id}`	Delete a category
`POST`	`/documents/{id}/categories/{cat_id}`	Assign category to document
`DELETE`	`/documents/{id}/categories/{cat_id}`	Remove category from document
`POST`	`/documents/{id}/suggestions/folder/confirm`	Apply AI folder suggestion → create/find category + assign
`POST`	`/documents/{id}/suggestions/folder/reject`	Clear AI folder suggestion
`POST`	`/documents/{id}/suggestions/filename/confirm`	Apply AI filename suggestion → set title
`POST`	`/documents/{id}/suggestions/filename/reject`	Clear AI filename suggestion

Plugin endpoints (internal — backend calls only)

Method	Path	Description
`GET`	`/plugin/manifest`	Static manifest: metadata, JSON Schema for settings, access rules
`GET`	`/plugin/settings`	Current watch/storage config values
`PATCH`	`/plugin/settings`	Update watch/storage config (persisted to `/config/doc_service_config.json`)

Pagination & filtering (`GET /documents`)

Query params:

Param	Default	Notes
`page`	1	≥ 1
`per_page`	20	1–100
`sort`	`created_at`	`created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type`
`order`	`desc`	`asc` \| `desc`
`status`	—	filter by status string
`document_type`	—	filter by document type
`search`	—	case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type`
`category_id`	—	filter to documents assigned to this category UUID

Response: { items: [...], total: N, page: N, pages: N }

Document schema

id                UUID
user_id           string (from x-user-id header; "watch" for watch-ingested docs)
filename          original filename
title             AI-suggested editable title (nullable)
file_size         bytes
status            pending | processing | done | failed
document_type     AI-classified type (nullable)
extracted_data    JSON string — all AI-extracted fields
tags              JSON array string — editable tags
error_message     set if status=failed
created_at        upload timestamp
processed_at      when extraction finished
source            "upload" (default) or "watch"
watch_path        original absolute path in watch directory (nullable)
suggested_folder  AI-suggested category name, pending user confirm (nullable)
suggested_filename AI-suggested title/rename, pending user confirm (nullable)
categories        many-to-many via category_assignments

Watch-ingested documents (user_id = "watch") are visible to all authenticated users.

AI extraction (via ai-service)

System prompt and user prompt template are loaded at runtime from doc_service_config.json (system_prompts key). Defaults are built into the service and used as fallback if the config key is absent. Changes made via the AI Settings UI take effect within 30 seconds (config cache TTL).

Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:

title — suggested human-readable title
document_type — invoice / bill / receipt / order / expense / revenue / unknown
tags — list of keyword tags
suggested_categories — list of category names to suggest in the UI
Domain fields: vendor, customer, invoice_number, due_date, total_amount, currency, etc.

Config (runtime, persisted to shared volume)

/config/doc_service_config.json:

{ "documents": { "max_pdf_bytes": 20971520 } }

Env override: DOC_MAX_PDF_MB

Watch directory feature

Controlled via plugin settings (UI accessible to superusers and doc-service-admin group members):

watch_enabled — toggle file watching (default: false)
watch_path — mount point (read-only, /data/watch; override via Docker volume)
ai_folder_suggestion — AI suggests a category for each ingested doc (user confirms)
ai_folder_default — default category when AI suggestion is disabled
ai_rename_suggestion — AI suggests a title for each ingested doc (user confirms)

On startup scan, the watcher walks the watch directory and ingests any PDFs not already in the database (idempotency check by watch_path). Subfolders are automatically mapped to categories (e.g. watch/invoices/bill.pdf → category "invoices"). No-remove policy: deleting a file from the watch directory does not delete the document record.

Database migrations

Revision	Description
0001	Initial schema (documents, categories, category_assignments)
0002	Add `title` column to documents
0003	Add `source`, `watch_path`, `suggested_folder`, `suggested_filename` columns

Run automatically on container start via alembic upgrade head.

Architecture

backend (proxy)  →  doc-service:8001
                        │
           ┌────────────┼────────────────────┐
      documents.py    categories.py        plugin.py
           │               │             (internal only)
  ┌────────┴────────┐
upload           list/get/patch/suggest
  │
save_upload()        pdfplumber extraction
  │                    │
Document(status=pending)   ai_client.classify_document()
  │                    │
BackgroundTask         ai-service:8010/chat
  │                    │
process_document()   JSON result → update doc row

file_watcher.py (watchdog Observer, daemon thread)
  │
  ├── _PdfEventHandler.on_created / on_moved
  │       └── asyncio.run_coroutine_threadsafe(ingest_file, loop)
  │
  └── _scan_existing() on startup (catches offline gaps)

Known limitations / not implemented

Re-process — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
Advanced field-level search — search param matches text fields via ILIKE but does not query into extracted_data JSON (e.g. filter by vendor or due_date)
Bulk operations — no bulk category assign/remove, no bulk delete
Document sharing — documents are strictly per-user; no group sharing yet
Pagination in categories — categories are returned as a full list (no pagination)
File type — only PDF supported
Concurrent uploads — no rate limiting per user

Future work

POST /documents/{id}/reprocess — re-run AI extraction
Watch directory feature with file watcher, startup scan, folder-to-category mapping, AI suggestion toggles
Plugin manifest endpoint (/plugin/manifest, /plugin/settings) for generic settings UI
Advanced filter: query extracted_data JSON fields (vendor, due_date, amount) — requires PostgreSQL jsonb column or indexed virtual columns
Bulk operations endpoint
Document sharing via groups (blocked on groups/permissions system in backend)
Support additional file types (images via OCR, DOCX)
Rate limiting on upload endpoint
Soft delete with restore
Category rename / delete with cascade handling
Frontend UI for suggestion badges (suggested_folder / suggested_filename confirm/reject buttons)

9.5 KiB Raw Blame History Unescape Escape