Files
Business-Management/features/doc-service/STATUS.md
T
curo1305 c4f0c7ad49 Add priority queue to ai-service and STATUS.md workflow
- Introduce async priority queue service in ai-service; all /chat calls now route through it
- Refactor chat router to separate execute_chat (core logic) from the HTTP handler
- Add /queue endpoints (status, pause, resume, cancel) for queue management
- Update ai-service config to use Pydantic v2 model_config style
- Add STATUS.md files for backend, ai-service, doc-service, and frontend
- Document STATUS.md workflow in CLAUDE.md
- Update doc-service documents router and schemas; frontend DocumentsPage and API client

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:58:10 +02:00

5.8 KiB
Raw Blame History

Doc Service — Status

What it is

PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis.

Port: 8001 (internal only, not exposed to host). All traffic arrives via the backend proxy (backend/app/routers/documents_proxy.py), which injects the authenticated x-user-id header.

Database: shared PostgreSQL instance, isolated via alembic_version_doc_service Alembic version table. Storage: /data/documents/ (Docker named volume doc_data).


Current functionality

Document lifecycle

  1. POST /documents/upload — validate PDF, persist file to /data/documents/{user_id}/{doc_id}.pdf, create DB row with status=pending, enqueue background extraction
  2. Background task: extract text with pdfplumber → POST to ai-service /chat → parse JSON result → update status=done (or failed)
  3. AI extracts: title, document_type, tags, suggested_categories, plus domain fields (vendor, customer, dates, amounts, etc.) into extracted_data (JSON string)

Endpoints

Method Path Description
POST /documents/upload Upload PDF; returns 202 with initial doc row
GET /documents Paginated list with filters and sort
GET /documents/{id} Single document
GET /documents/{id}/status Lightweight status poll
GET /documents/{id}/download Stream file bytes
DELETE /documents/{id} Delete document and file
PATCH /documents/{id}/type Update document type
PATCH /documents/{id}/tags Replace tag list (dedup, preserve order)
PATCH /documents/{id}/title Update editable title
GET /documents/categories List all categories for the user
POST /documents/categories Create a category
POST /documents/{id}/categories/{cat_id} Assign category to document
DELETE /documents/{id}/categories/{cat_id} Remove category from document

Pagination & filtering (GET /documents)

Query params:

Param Default Notes
page 1 ≥ 1
per_page 20 1100
sort created_at created_at, processed_at, filename, title, file_size, status, document_type
order desc asc | desc
status filter by status string
document_type filter by document type
search case-insensitive ILIKE on title, filename, tags, document_type

Response: { items: [...], total: N, page: N, pages: N }

Document schema

id            UUID
user_id       string (from x-user-id header)
filename      original filename
title         AI-suggested editable title (nullable)
file_size     bytes
status        pending | processing | done | failed
document_type AI-classified type (nullable)
extracted_data JSON string — all AI-extracted fields
tags          JSON array string — editable tags
error_message set if status=failed
created_at    upload timestamp
processed_at  when extraction finished
categories    many-to-many via category_assignments

AI extraction (via ai-service)

Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:

  • title — suggested human-readable title
  • document_type — invoice / bill / receipt / order / expense / revenue / unknown
  • tags — list of keyword tags
  • suggested_categories — list of category names to suggest in the UI
  • Domain fields: vendor, customer, invoice_number, due_date, total_amount, currency, etc.

Config (runtime, persisted to shared volume)

/config/doc_service_config.json:

{ "documents": { "max_pdf_bytes": 20971520 } }

Env override: DOC_MAX_PDF_MB

Database migrations

Revision Description
0001 Initial schema (documents, categories, category_assignments)
0002 Add title column to documents

Run automatically on container start via alembic upgrade head.


Architecture

backend (proxy)  →  doc-service:8001
                        │
                   documents.py router
                        │
               ┌────────┴────────┐
          upload              list/get/patch
               │
        save_upload()        pdfplumber extraction
               │                    │
         Document(status=pending)   ai_client.classify_document()
               │                    │
        BackgroundTask         ai-service:8010/chat
               │                    │
         process_document()   JSON result → update doc row

Known limitations / not implemented

  • Re-process — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
  • Advanced field-level searchsearch param matches text fields via ILIKE but does not query into extracted_data JSON (e.g. filter by vendor or due_date)
  • Bulk operations — no bulk category assign/remove, no bulk delete
  • Document sharing — documents are strictly per-user; no group sharing yet
  • Pagination in categories — categories are returned as a full list (no pagination)
  • File type — only PDF supported
  • Concurrent uploads — no rate limiting per user

Future work

  • POST /documents/{id}/reprocess — re-run AI extraction
  • Advanced filter: query extracted_data JSON fields (vendor, due_date, amount) — requires PostgreSQL jsonb column or indexed virtual columns
  • Bulk operations endpoint
  • Document sharing via groups (blocked on groups/permissions system in backend)
  • Support additional file types (images via OCR, DOCX)
  • Rate limiting on upload endpoint
  • Soft delete with restore
  • Category rename / delete with cascade handling