94901fc30f
- Three-column layout: Sidebar + SourcePanel (views + searchable category tree) + main - DocumentSlideOver (480px right panel): inline editing, type picker, AI suggestion confirm/reject, categories combobox, tags editor, sharing section, raw text, re-analyse/delete actions - ManageCategoriesDialog: inline rename, delete with confirm, search filter - DocumentsPage rewrite: filter chip system, multi-file upload queue, drag-and-drop overlay, bulk actions bar (share/delete), smart TanStack Query polling, URL-driven view state - Sidebar simplified: per-category NavLinks removed; Documents = single NavLink under Apps - Backend: document_shares table (migration 0004), share CRUD endpoints, shared-with-me view, N+1-safe share_count via GROUP BY, recipient download access, X-User-Groups header enforcement - Gateway proxy: injects X-User-Groups header into all document + category proxy requests - Backend users: GET /api/users/me/groups endpoint for share picker combobox - CLAUDE.md, STATUS.md files, and changelog updated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
211 lines
11 KiB
Markdown
211 lines
11 KiB
Markdown
# Doc Service — Status
|
||
|
||
## What it is
|
||
|
||
PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis. Also supports automatic ingestion from a mounted watch directory (NAS, Nextcloud, Syncthing, etc.).
|
||
|
||
Port: `8001` (internal only, not exposed to host). All traffic arrives via the backend proxy (`backend/app/routers/documents_proxy.py`), which injects the authenticated `x-user-id` header.
|
||
|
||
Database: shared PostgreSQL instance, isolated via `alembic_version_doc_service` Alembic version table. Storage: `/data/documents/` (Docker named volume `doc_data`). Watch directory: `/data/watch` (named volume `watch_data` in prod; bind-mount in dev via `docker-compose.dev.yml`).
|
||
|
||
---
|
||
|
||
## Current functionality
|
||
|
||
### Document lifecycle
|
||
|
||
1. `POST /documents/upload` — validate PDF, persist file to `/data/documents/{user_id}/{doc_id}.pdf`, create DB row with `status=pending`, enqueue background extraction
|
||
2. Background task: extract text with `pdfplumber` → POST to ai-service `/chat` → parse JSON result → update `status=done` (or `failed`)
|
||
3. AI extracts: `title`, `document_type`, `tags`, `suggested_categories`, plus domain fields (vendor, customer, dates, amounts, etc.) into `extracted_data` (JSON string)
|
||
|
||
### Endpoints
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `POST` | `/documents/upload` | Upload PDF; returns 202 with initial doc row |
|
||
| `GET` | `/documents` | Paginated list with filters, sort, and optional `category_id` filter |
|
||
| `GET` | `/documents/{id}` | Single document |
|
||
| `GET` | `/documents/{id}/status` | Lightweight status poll |
|
||
| `GET` | `/documents/{id}/download` | Stream file bytes |
|
||
| `DELETE` | `/documents/{id}` | Delete document and file |
|
||
| `PATCH` | `/documents/{id}/type` | Update document type |
|
||
| `PATCH` | `/documents/{id}/tags` | Replace tag list (dedup, preserve order) |
|
||
| `PATCH` | `/documents/{id}/title` | Update editable title |
|
||
| `GET` | `/documents/categories` | List all categories (user + watch) |
|
||
| `POST` | `/documents/categories` | Create a category; triggers re-analysis of documents in similar categories |
|
||
| `POST` | `/documents/{id}/reprocess` | Reset status to pending and re-run AI extraction; 409 if already pending/processing |
|
||
| `PATCH` | `/documents/categories/{id}` | Rename a category |
|
||
| `DELETE` | `/documents/categories/{id}` | Delete a category |
|
||
| `POST` | `/documents/{id}/categories/{cat_id}` | Assign category to document |
|
||
| `DELETE` | `/documents/{id}/categories/{cat_id}` | Remove category from document |
|
||
| `POST` | `/documents/{id}/suggestions/folder/confirm` | Apply AI folder suggestion → create/find category + assign |
|
||
| `POST` | `/documents/{id}/suggestions/folder/reject` | Clear AI folder suggestion |
|
||
| `POST` | `/documents/{id}/suggestions/filename/confirm` | Apply AI filename suggestion → set title |
|
||
| `POST` | `/documents/{id}/suggestions/filename/reject` | Clear AI filename suggestion |
|
||
|
||
### Plugin endpoints (internal — backend calls only)
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `GET` | `/plugin/manifest` | Static manifest: metadata, JSON Schema for settings, access rules |
|
||
| `GET` | `/plugin/settings` | Current watch/storage config values |
|
||
| `PATCH` | `/plugin/settings` | Update watch/storage config (persisted to `/config/doc_service_config.json`) |
|
||
|
||
### Pagination & filtering (`GET /documents`)
|
||
|
||
Query params:
|
||
|
||
| Param | Default | Notes |
|
||
|-------|---------|-------|
|
||
| `page` | 1 | ≥ 1 |
|
||
| `per_page` | 20 | 1–100 |
|
||
| `sort` | `created_at` | `created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type` |
|
||
| `order` | `desc` | `asc` \| `desc` |
|
||
| `status` | — | filter by status string |
|
||
| `document_type` | — | filter by document type |
|
||
| `search` | — | case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type` |
|
||
| `category_id` | — | filter to documents assigned to this category UUID |
|
||
|
||
Response: `{ items: [...], total: N, page: N, pages: N }`
|
||
|
||
### Document schema
|
||
|
||
```
|
||
id UUID
|
||
user_id string (from x-user-id header; "watch" for watch-ingested docs)
|
||
filename original filename
|
||
title AI-suggested editable title (nullable)
|
||
file_size bytes
|
||
status pending | processing | done | failed
|
||
document_type AI-classified type (nullable)
|
||
extracted_data JSON string — all AI-extracted fields
|
||
tags JSON array string — editable tags
|
||
error_message set if status=failed
|
||
created_at upload timestamp
|
||
processed_at when extraction finished
|
||
source "upload" (default) or "watch"
|
||
watch_path original absolute path in watch directory (nullable)
|
||
suggested_folder AI-suggested category name, pending user confirm (nullable)
|
||
suggested_filename AI-suggested title/rename, pending user confirm (nullable)
|
||
categories many-to-many via category_assignments
|
||
```
|
||
|
||
Watch-ingested documents (`user_id = "watch"`) are visible to all authenticated users.
|
||
|
||
### AI extraction (via ai-service)
|
||
|
||
System prompt and user prompt template are loaded at runtime from `doc_service_config.json` (`system_prompts` key). Defaults are built into the service and used as fallback if the config key is absent. Changes made via the AI Settings UI take effect within 30 seconds (config cache TTL).
|
||
|
||
Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:
|
||
- `title` — suggested human-readable title
|
||
- `document_type` — invoice / bill / receipt / order / expense / revenue / unknown
|
||
- `tags` — list of keyword tags
|
||
- `suggested_categories` — list of category names to suggest in the UI
|
||
- Domain fields: `vendor`, `customer`, `invoice_number`, `due_date`, `total_amount`, `currency`, etc.
|
||
|
||
### Config (runtime, persisted to shared volume)
|
||
|
||
`/config/doc_service_config.json`:
|
||
```json
|
||
{ "documents": { "max_pdf_bytes": 20971520 } }
|
||
```
|
||
Env override: `DOC_MAX_PDF_MB`
|
||
|
||
### Watch directory feature
|
||
|
||
Controlled via plugin settings (UI accessible to superusers and `doc-service-admin` group members):
|
||
|
||
- `watch_enabled` — toggle file watching (default: false)
|
||
- `watch_path` — mount point (read-only, `/data/watch`; override via Docker volume)
|
||
- `ai_folder_suggestion` — AI suggests a category for each ingested doc (user confirms)
|
||
- `ai_folder_default` — default category when AI suggestion is disabled
|
||
- `ai_rename_suggestion` — AI suggests a title for each ingested doc (user confirms)
|
||
|
||
On startup scan, the watcher walks the watch directory and ingests any PDFs not already in the database (idempotency check by `watch_path`). Subfolders are automatically mapped to categories (e.g. `watch/invoices/bill.pdf` → category "invoices"). No-remove policy: deleting a file from the watch directory does not delete the document record.
|
||
|
||
### Document sharing (`document_shares`)
|
||
|
||
Group-based sharing allows a document owner to share a document with all members of any group they belong to. Recipients can view and download the shared document; they cannot edit, re-analyse, delete, or re-share it.
|
||
|
||
The gateway injects `X-User-Groups: <group_id1>,<group_id2>,...` alongside the existing `X-User-Id` header, so doc-service can evaluate group access without querying the backend DB.
|
||
|
||
| Method | Path | Auth | Description |
|
||
|--------|------|------|-------------|
|
||
| `GET` | `/documents/shared-with-me` | Any user | Documents shared with the user via their groups; excludes own docs |
|
||
| `GET` | `/documents/{id}/shares` | Owner only | List all groups the document is shared with |
|
||
| `POST` | `/documents/{id}/shares` | Owner only | Share with a group (`{group_id}` in body); group must be in X-User-Groups |
|
||
| `DELETE` | `/documents/{id}/shares/{group_id}` | Owner only | Stop sharing with that group |
|
||
|
||
`DocumentOut` now includes `share_count: int` — the number of groups the document is shared with.
|
||
|
||
`GET /documents/{id}/file` also allows access to shared documents (recipients can download).
|
||
|
||
### Database migrations
|
||
|
||
| Revision | Description |
|
||
|----------|-------------|
|
||
| 0001 | Initial schema (documents, categories, category_assignments) |
|
||
| 0002 | Add `title` column to documents |
|
||
| 0003 | Add `source`, `watch_path`, `suggested_folder`, `suggested_filename` columns |
|
||
| 0004 | Add `document_shares` table (document_id, group_id, shared_by_user_id, created_at) |
|
||
|
||
Run automatically on container start via `alembic upgrade head`.
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
backend (proxy) → doc-service:8001
|
||
│
|
||
┌────────────┼────────────────────┐
|
||
documents.py categories.py plugin.py
|
||
│ │ (internal only)
|
||
┌────────┴────────┐
|
||
upload list/get/patch/suggest
|
||
│
|
||
save_upload() pdfplumber extraction
|
||
│ │
|
||
Document(status=pending) ai_client.classify_document()
|
||
│ │
|
||
BackgroundTask ai-service:8010/chat
|
||
│ │
|
||
process_document() JSON result → update doc row
|
||
|
||
file_watcher.py (watchdog Observer, daemon thread)
|
||
│
|
||
├── _PdfEventHandler.on_created / on_moved
|
||
│ └── asyncio.run_coroutine_threadsafe(ingest_file, loop)
|
||
│
|
||
└── _scan_existing() on startup (catches offline gaps)
|
||
```
|
||
|
||
---
|
||
|
||
## Known limitations / not implemented
|
||
|
||
- **Re-process** — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
|
||
- **Advanced field-level search** — `search` param matches text fields via ILIKE but does not query into `extracted_data` JSON (e.g. filter by `vendor` or `due_date`)
|
||
- **Bulk operations** — no bulk category assign/remove endpoint (frontend handles bulk delete/share individually)
|
||
- **Advanced field-level search** — `search` matches text fields via ILIKE but does not query into `extracted_data` JSON
|
||
- **Pagination in categories** — categories are returned as a full list (no pagination)
|
||
- **File type** — only PDF supported
|
||
- **Concurrent uploads** — no rate limiting per user
|
||
|
||
---
|
||
|
||
## Future work
|
||
|
||
- [x] `POST /documents/{id}/reprocess` — re-run AI extraction
|
||
- [x] Watch directory feature with file watcher, startup scan, folder-to-category mapping, AI suggestion toggles
|
||
- [x] Plugin manifest endpoint (`/plugin/manifest`, `/plugin/settings`) for generic settings UI
|
||
- [ ] Advanced filter: query `extracted_data` JSON fields (vendor, due_date, amount) — requires PostgreSQL `jsonb` column or indexed virtual columns
|
||
- [ ] Bulk operations endpoint
|
||
- [x] Document sharing via groups — `document_shares` table + share endpoints + shared-with-me view
|
||
- [x] Frontend UI for suggestion badges (suggested_folder / suggested_filename confirm/reject buttons in slide-over)
|
||
- [ ] Advanced filter: query `extracted_data` JSON fields (vendor, due_date, amount)
|
||
- [ ] Support additional file types (images via OCR, DOCX)
|
||
- [ ] Rate limiting on upload endpoint
|
||
- [ ] Soft delete with restore
|
||
- [ ] Edit rights for shared recipients (V2)
|