Files

T

curo1305 c4f0c7ad49 Add priority queue to ai-service and STATUS.md workflow

- Introduce async priority queue service in ai-service; all /chat calls now route through it
- Refactor chat router to separate execute_chat (core logic) from the HTTP handler
- Add /queue endpoints (status, pause, resume, cancel) for queue management
- Update ai-service config to use Pydantic v2 model_config style
- Add STATUS.md files for backend, ai-service, doc-service, and frontend
- Document STATUS.md workflow in CLAUDE.md
- Update doc-service documents router and schemas; frontend DocumentsPage and API client

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-14 22:58:10 +02:00

5.8 KiB

Raw Blame History

Doc Service — Status

What it is

PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis.

Port: 8001 (internal only, not exposed to host). All traffic arrives via the backend proxy (backend/app/routers/documents_proxy.py), which injects the authenticated x-user-id header.

Database: shared PostgreSQL instance, isolated via alembic_version_doc_service Alembic version table. Storage: /data/documents/ (Docker named volume doc_data).

Current functionality

Document lifecycle

POST /documents/upload — validate PDF, persist file to /data/documents/{user_id}/{doc_id}.pdf, create DB row with status=pending, enqueue background extraction
Background task: extract text with pdfplumber → POST to ai-service /chat → parse JSON result → update status=done (or failed)
AI extracts: title, document_type, tags, suggested_categories, plus domain fields (vendor, customer, dates, amounts, etc.) into extracted_data (JSON string)

Endpoints

Method	Path	Description
`POST`	`/documents/upload`	Upload PDF; returns 202 with initial doc row
`GET`	`/documents`	Paginated list with filters and sort
`GET`	`/documents/{id}`	Single document
`GET`	`/documents/{id}/status`	Lightweight status poll
`GET`	`/documents/{id}/download`	Stream file bytes
`DELETE`	`/documents/{id}`	Delete document and file
`PATCH`	`/documents/{id}/type`	Update document type
`PATCH`	`/documents/{id}/tags`	Replace tag list (dedup, preserve order)
`PATCH`	`/documents/{id}/title`	Update editable title
`GET`	`/documents/categories`	List all categories for the user
`POST`	`/documents/categories`	Create a category
`POST`	`/documents/{id}/categories/{cat_id}`	Assign category to document
`DELETE`	`/documents/{id}/categories/{cat_id}`	Remove category from document

Pagination & filtering (`GET /documents`)

Query params:

Param	Default	Notes
`page`	1	≥ 1
`per_page`	20	1–100
`sort`	`created_at`	`created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type`
`order`	`desc`	`asc` \| `desc`
`status`	—	filter by status string
`document_type`	—	filter by document type
`search`	—	case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type`

Response: { items: [...], total: N, page: N, pages: N }

Document schema

id            UUID
user_id       string (from x-user-id header)
filename      original filename
title         AI-suggested editable title (nullable)
file_size     bytes
status        pending | processing | done | failed
document_type AI-classified type (nullable)
extracted_data JSON string — all AI-extracted fields
tags          JSON array string — editable tags
error_message set if status=failed
created_at    upload timestamp
processed_at  when extraction finished
categories    many-to-many via category_assignments

AI extraction (via ai-service)

Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:

title — suggested human-readable title
document_type — invoice / bill / receipt / order / expense / revenue / unknown
tags — list of keyword tags
suggested_categories — list of category names to suggest in the UI
Domain fields: vendor, customer, invoice_number, due_date, total_amount, currency, etc.

Config (runtime, persisted to shared volume)

/config/doc_service_config.json:

{ "documents": { "max_pdf_bytes": 20971520 } }

Env override: DOC_MAX_PDF_MB

Database migrations

Revision	Description
0001	Initial schema (documents, categories, category_assignments)
0002	Add `title` column to documents

Run automatically on container start via alembic upgrade head.

Architecture

backend (proxy)  →  doc-service:8001
                        │
                   documents.py router
                        │
               ┌────────┴────────┐
          upload              list/get/patch
               │
        save_upload()        pdfplumber extraction
               │                    │
         Document(status=pending)   ai_client.classify_document()
               │                    │
        BackgroundTask         ai-service:8010/chat
               │                    │
         process_document()   JSON result → update doc row

Known limitations / not implemented

Re-process — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
Advanced field-level search — search param matches text fields via ILIKE but does not query into extracted_data JSON (e.g. filter by vendor or due_date)
Bulk operations — no bulk category assign/remove, no bulk delete
Document sharing — documents are strictly per-user; no group sharing yet
Pagination in categories — categories are returned as a full list (no pagination)
File type — only PDF supported
Concurrent uploads — no rate limiting per user

Future work

POST /documents/{id}/reprocess — re-run AI extraction
Advanced filter: query extracted_data JSON fields (vendor, due_date, amount) — requires PostgreSQL jsonb column or indexed virtual columns
Bulk operations endpoint
Document sharing via groups (blocked on groups/permissions system in backend)
Support additional file types (images via OCR, DOCX)
Rate limiting on upload endpoint
Soft delete with restore
Category rename / delete with cascade handling

5.8 KiB Raw Blame History Unescape Escape