Files
Business-Management/features/doc-service/STATUS.md
T
curo1305 d2042153a7 Add re-analyse button and POST /documents/{id}/reprocess endpoint
Resets status to pending, clears error_message, and re-enqueues the
background AI extraction task. Button is disabled while the document
is already pending or processing; returns 409 in that case from the API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 17:00:17 +02:00

6.5 KiB
Raw Blame History

Doc Service — Status

What it is

PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis.

Port: 8001 (internal only, not exposed to host). All traffic arrives via the backend proxy (backend/app/routers/documents_proxy.py), which injects the authenticated x-user-id header.

Database: shared PostgreSQL instance, isolated via alembic_version_doc_service Alembic version table. Storage: /data/documents/ (Docker named volume doc_data).


Current functionality

Document lifecycle

  1. POST /documents/upload — validate PDF, persist file to /data/documents/{user_id}/{doc_id}.pdf, create DB row with status=pending, enqueue background extraction
  2. Background task: extract text with pdfplumber → POST to ai-service /chat → parse JSON result → update status=done (or failed)
  3. AI extracts: title, document_type, tags, suggested_categories, plus domain fields (vendor, customer, dates, amounts, etc.) into extracted_data (JSON string)

Endpoints

Method Path Description
POST /documents/upload Upload PDF; returns 202 with initial doc row
GET /documents Paginated list with filters, sort, and optional category_id filter
GET /documents/{id} Single document
GET /documents/{id}/status Lightweight status poll
GET /documents/{id}/download Stream file bytes
DELETE /documents/{id} Delete document and file
PATCH /documents/{id}/type Update document type
PATCH /documents/{id}/tags Replace tag list (dedup, preserve order)
PATCH /documents/{id}/title Update editable title
GET /documents/categories List all categories for the user
POST /documents/categories Create a category; triggers re-analysis of documents in similar categories
POST /documents/{id}/reprocess Reset status to pending and re-run AI extraction; 409 if already pending/processing
PATCH /documents/categories/{id} Rename a category
DELETE /documents/categories/{id} Delete a category
POST /documents/{id}/categories/{cat_id} Assign category to document
DELETE /documents/{id}/categories/{cat_id} Remove category from document

Pagination & filtering (GET /documents)

Query params:

Param Default Notes
page 1 ≥ 1
per_page 20 1100
sort created_at created_at, processed_at, filename, title, file_size, status, document_type
order desc asc | desc
status filter by status string
document_type filter by document type
search case-insensitive ILIKE on title, filename, tags, document_type
category_id filter to documents assigned to this category UUID

Response: { items: [...], total: N, page: N, pages: N }

Document schema

id            UUID
user_id       string (from x-user-id header)
filename      original filename
title         AI-suggested editable title (nullable)
file_size     bytes
status        pending | processing | done | failed
document_type AI-classified type (nullable)
extracted_data JSON string — all AI-extracted fields
tags          JSON array string — editable tags
error_message set if status=failed
created_at    upload timestamp
processed_at  when extraction finished
categories    many-to-many via category_assignments

AI extraction (via ai-service)

System prompt and user prompt template are loaded at runtime from doc_service_config.json (system_prompts key). Defaults are built into the service and used as fallback if the config key is absent. Changes made via the AI Settings UI take effect within 30 seconds (config cache TTL).

Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:

  • title — suggested human-readable title
  • document_type — invoice / bill / receipt / order / expense / revenue / unknown
  • tags — list of keyword tags
  • suggested_categories — list of category names to suggest in the UI
  • Domain fields: vendor, customer, invoice_number, due_date, total_amount, currency, etc.

Config (runtime, persisted to shared volume)

/config/doc_service_config.json:

{ "documents": { "max_pdf_bytes": 20971520 } }

Env override: DOC_MAX_PDF_MB

Database migrations

Revision Description
0001 Initial schema (documents, categories, category_assignments)
0002 Add title column to documents

Run automatically on container start via alembic upgrade head.


Architecture

backend (proxy)  →  doc-service:8001
                        │
                   documents.py router
                        │
               ┌────────┴────────┐
          upload              list/get/patch
               │
        save_upload()        pdfplumber extraction
               │                    │
         Document(status=pending)   ai_client.classify_document()
               │                    │
        BackgroundTask         ai-service:8010/chat
               │                    │
         process_document()   JSON result → update doc row

Known limitations / not implemented

  • Re-process — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
  • Advanced field-level searchsearch param matches text fields via ILIKE but does not query into extracted_data JSON (e.g. filter by vendor or due_date)
  • Bulk operations — no bulk category assign/remove, no bulk delete
  • Document sharing — documents are strictly per-user; no group sharing yet
  • Pagination in categories — categories are returned as a full list (no pagination)
  • File type — only PDF supported
  • Concurrent uploads — no rate limiting per user

Future work

  • POST /documents/{id}/reprocess — re-run AI extraction
  • Advanced filter: query extracted_data JSON fields (vendor, due_date, amount) — requires PostgreSQL jsonb column or indexed virtual columns
  • Bulk operations endpoint
  • Document sharing via groups (blocked on groups/permissions system in backend)
  • Support additional file types (images via OCR, DOCX)
  • Rate limiting on upload endpoint
  • Soft delete with restore
  • Category rename / delete with cascade handling