- Introduce async priority queue service in ai-service; all /chat calls now route through it - Refactor chat router to separate execute_chat (core logic) from the HTTP handler - Add /queue endpoints (status, pause, resume, cancel) for queue management - Update ai-service config to use Pydantic v2 model_config style - Add STATUS.md files for backend, ai-service, doc-service, and frontend - Document STATUS.md workflow in CLAUDE.md - Update doc-service documents router and schemas; frontend DocumentsPage and API client Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.8 KiB
Doc Service — Status
What it is
PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis.
Port: 8001 (internal only, not exposed to host). All traffic arrives via the backend proxy (backend/app/routers/documents_proxy.py), which injects the authenticated x-user-id header.
Database: shared PostgreSQL instance, isolated via alembic_version_doc_service Alembic version table. Storage: /data/documents/ (Docker named volume doc_data).
Current functionality
Document lifecycle
POST /documents/upload— validate PDF, persist file to/data/documents/{user_id}/{doc_id}.pdf, create DB row withstatus=pending, enqueue background extraction- Background task: extract text with
pdfplumber→ POST to ai-service/chat→ parse JSON result → updatestatus=done(orfailed) - AI extracts:
title,document_type,tags,suggested_categories, plus domain fields (vendor, customer, dates, amounts, etc.) intoextracted_data(JSON string)
Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/documents/upload |
Upload PDF; returns 202 with initial doc row |
GET |
/documents |
Paginated list with filters and sort |
GET |
/documents/{id} |
Single document |
GET |
/documents/{id}/status |
Lightweight status poll |
GET |
/documents/{id}/download |
Stream file bytes |
DELETE |
/documents/{id} |
Delete document and file |
PATCH |
/documents/{id}/type |
Update document type |
PATCH |
/documents/{id}/tags |
Replace tag list (dedup, preserve order) |
PATCH |
/documents/{id}/title |
Update editable title |
GET |
/documents/categories |
List all categories for the user |
POST |
/documents/categories |
Create a category |
POST |
/documents/{id}/categories/{cat_id} |
Assign category to document |
DELETE |
/documents/{id}/categories/{cat_id} |
Remove category from document |
Pagination & filtering (GET /documents)
Query params:
| Param | Default | Notes |
|---|---|---|
page |
1 | ≥ 1 |
per_page |
20 | 1–100 |
sort |
created_at |
created_at, processed_at, filename, title, file_size, status, document_type |
order |
desc |
asc | desc |
status |
— | filter by status string |
document_type |
— | filter by document type |
search |
— | case-insensitive ILIKE on title, filename, tags, document_type |
Response: { items: [...], total: N, page: N, pages: N }
Document schema
id UUID
user_id string (from x-user-id header)
filename original filename
title AI-suggested editable title (nullable)
file_size bytes
status pending | processing | done | failed
document_type AI-classified type (nullable)
extracted_data JSON string — all AI-extracted fields
tags JSON array string — editable tags
error_message set if status=failed
created_at upload timestamp
processed_at when extraction finished
categories many-to-many via category_assignments
AI extraction (via ai-service)
Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:
title— suggested human-readable titledocument_type— invoice / bill / receipt / order / expense / revenue / unknowntags— list of keyword tagssuggested_categories— list of category names to suggest in the UI- Domain fields:
vendor,customer,invoice_number,due_date,total_amount,currency, etc.
Config (runtime, persisted to shared volume)
/config/doc_service_config.json:
{ "documents": { "max_pdf_bytes": 20971520 } }
Env override: DOC_MAX_PDF_MB
Database migrations
| Revision | Description |
|---|---|
| 0001 | Initial schema (documents, categories, category_assignments) |
| 0002 | Add title column to documents |
Run automatically on container start via alembic upgrade head.
Architecture
backend (proxy) → doc-service:8001
│
documents.py router
│
┌────────┴────────┐
upload list/get/patch
│
save_upload() pdfplumber extraction
│ │
Document(status=pending) ai_client.classify_document()
│ │
BackgroundTask ai-service:8010/chat
│ │
process_document() JSON result → update doc row
Known limitations / not implemented
- Re-process — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
- Advanced field-level search —
searchparam matches text fields via ILIKE but does not query intoextracted_dataJSON (e.g. filter byvendorordue_date) - Bulk operations — no bulk category assign/remove, no bulk delete
- Document sharing — documents are strictly per-user; no group sharing yet
- Pagination in categories — categories are returned as a full list (no pagination)
- File type — only PDF supported
- Concurrent uploads — no rate limiting per user
Future work
POST /documents/{id}/reprocess— re-run AI extraction- Advanced filter: query
extracted_dataJSON fields (vendor, due_date, amount) — requires PostgreSQLjsonbcolumn or indexed virtual columns - Bulk operations endpoint
- Document sharing via groups (blocked on groups/permissions system in backend)
- Support additional file types (images via OCR, DOCX)
- Rate limiting on upload endpoint
- Soft delete with restore
- Category rename / delete with cascade handling