Add priority queue to ai-service and STATUS.md workflow

- Introduce async priority queue service in ai-service; all /chat calls now route through it - Refactor chat router to separate execute_chat (core logic) from the HTTP handler - Add /queue endpoints (status, pause, resume, cancel) for queue management - Update ai-service config to use Pydantic v2 model_config style - Add STATUS.md files for backend, ai-service, doc-service, and frontend - Document STATUS.md workflow in CLAUDE.md - Update doc-service documents router and schemas; frontend DocumentsPage and API client Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:58:10 +02:00
parent d2495190a9
commit c4f0c7ad49
18 changed files with 1253 additions and 35 deletions
@@ -0,0 +1,143 @@
+# Doc Service — Status
+
+## What it is
+
+PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis.
+
+Port: `8001` (internal only, not exposed to host). All traffic arrives via the backend proxy (`backend/app/routers/documents_proxy.py`), which injects the authenticated `x-user-id` header.
+
+Database: shared PostgreSQL instance, isolated via `alembic_version_doc_service` Alembic version table. Storage: `/data/documents/` (Docker named volume `doc_data`).
+
+---
+
+## Current functionality
+
+### Document lifecycle
+
+1. `POST /documents/upload` — validate PDF, persist file to `/data/documents/{user_id}/{doc_id}.pdf`, create DB row with `status=pending`, enqueue background extraction
+2. Background task: extract text with `pdfplumber` → POST to ai-service `/chat` → parse JSON result → update `status=done` (or `failed`)
+3. AI extracts: `title`, `document_type`, `tags`, `suggested_categories`, plus domain fields (vendor, customer, dates, amounts, etc.) into `extracted_data` (JSON string)
+
+### Endpoints
+
+| Method | Path | Description |
+|--------|------|-------------|
+| `POST` | `/documents/upload` | Upload PDF; returns 202 with initial doc row |
+| `GET` | `/documents` | Paginated list with filters and sort |
+| `GET` | `/documents/{id}` | Single document |
+| `GET` | `/documents/{id}/status` | Lightweight status poll |
+| `GET` | `/documents/{id}/download` | Stream file bytes |
+| `DELETE` | `/documents/{id}` | Delete document and file |
+| `PATCH` | `/documents/{id}/type` | Update document type |
+| `PATCH` | `/documents/{id}/tags` | Replace tag list (dedup, preserve order) |
+| `PATCH` | `/documents/{id}/title` | Update editable title |
+| `GET` | `/documents/categories` | List all categories for the user |
+| `POST` | `/documents/categories` | Create a category |
+| `POST` | `/documents/{id}/categories/{cat_id}` | Assign category to document |
+| `DELETE` | `/documents/{id}/categories/{cat_id}` | Remove category from document |
+
+### Pagination & filtering (`GET /documents`)
+
+Query params:
+
+| Param | Default | Notes |
+|-------|---------|-------|
+| `page` | 1 | ≥ 1 |
+| `per_page` | 20 | 1–100 |
+| `sort` | `created_at` | `created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type` |
+| `order` | `desc` | `asc` \| `desc` |
+| `status` | — | filter by status string |
+| `document_type` | — | filter by document type |
+| `search` | — | case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type` |
+
+Response: `{ items: [...], total: N, page: N, pages: N }`
+
+### Document schema
+
+```
+id            UUID
+user_id       string (from x-user-id header)
+filename      original filename
+title         AI-suggested editable title (nullable)
+file_size     bytes
+status        pending | processing | done | failed
+document_type AI-classified type (nullable)
+extracted_data JSON string — all AI-extracted fields
+tags          JSON array string — editable tags
+error_message set if status=failed
+created_at    upload timestamp
+processed_at  when extraction finished
+categories    many-to-many via category_assignments
+```
+
+### AI extraction (via ai-service)
+
+Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:
+- `title` — suggested human-readable title
+- `document_type` — invoice / bill / receipt / order / expense / revenue / unknown
+- `tags` — list of keyword tags
+- `suggested_categories` — list of category names to suggest in the UI
+- Domain fields: `vendor`, `customer`, `invoice_number`, `due_date`, `total_amount`, `currency`, etc.
+
+### Config (runtime, persisted to shared volume)
+
+`/config/doc_service_config.json`:
+```json
+{ "documents": { "max_pdf_bytes": 20971520 } }
+```
+Env override: `DOC_MAX_PDF_MB`
+
+### Database migrations
+
+| Revision | Description |
+|----------|-------------|
+| 0001 | Initial schema (documents, categories, category_assignments) |
+| 0002 | Add `title` column to documents |
+
+Run automatically on container start via `alembic upgrade head`.
+
+---
+
+## Architecture
+
+```
+backend (proxy)  →  doc-service:8001
+                        │
+                   documents.py router
+                        │
+               ┌────────┴────────┐
+          upload              list/get/patch
+               │
+        save_upload()        pdfplumber extraction
+               │                    │
+         Document(status=pending)   ai_client.classify_document()
+               │                    │
+        BackgroundTask         ai-service:8010/chat
+               │                    │
+         process_document()   JSON result → update doc row
+```
+
+---
+
+## Known limitations / not implemented
+
+- **Re-process** — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
+- **Advanced field-level search** — `search` param matches text fields via ILIKE but does not query into `extracted_data` JSON (e.g. filter by `vendor` or `due_date`)
+- **Bulk operations** — no bulk category assign/remove, no bulk delete
+- **Document sharing** — documents are strictly per-user; no group sharing yet
+- **Pagination in categories** — categories are returned as a full list (no pagination)
+- **File type** — only PDF supported
+- **Concurrent uploads** — no rate limiting per user
+
+---
+
+## Future work
+
+- [ ] `POST /documents/{id}/reprocess` — re-run AI extraction
+- [ ] Advanced filter: query `extracted_data` JSON fields (vendor, due_date, amount) — requires PostgreSQL `jsonb` column or indexed virtual columns
+- [ ] Bulk operations endpoint
+- [ ] Document sharing via groups (blocked on groups/permissions system in backend)
+- [ ] Support additional file types (images via OCR, DOCX)
+- [ ] Rate limiting on upload endpoint
+- [ ] Soft delete with restore
+- [ ] Category rename / delete with cascade handling
@@ -1,13 +1,14 @@
 import asyncio
 import json
+import math
 import uuid
 from datetime import datetime, timezone

 import aiofiles
 import pdfplumber
-from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, UploadFile
+from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, Query, UploadFile
 from fastapi.responses import StreamingResponse
-from sqlalchemy import select
+from sqlalchemy import func, or_, select
 from sqlalchemy.ext.asyncio import AsyncSession
 from sqlalchemy.orm import selectinload

@@ -16,7 +17,7 @@ from app.deps import get_user_id
 from app.models.category import DocumentCategory
 from app.models.category_assignment import CategoryAssignment
 from app.models.document import Document
-from app.schemas.document import DocumentOut, DocumentStatusOut, DocumentTypeUpdate, TagsUpdate, TitleUpdate
+from app.schemas.document import DocumentOut, DocumentPage, DocumentStatusOut, DocumentTypeUpdate, TagsUpdate, TitleUpdate
 from app.services.ai_client import AIServiceError, classify_document
 from app.services.config_reader import load_doc_config
 from app.services.storage import delete_file, get_upload_path, save_upload
@@ -50,6 +51,7 @@ def _doc_with_categories(doc: Document) -> DocumentOut:
        id=doc.id,
        user_id=doc.user_id,
        filename=doc.filename,
+        title=doc.title,
        file_size=doc.file_size,
        status=doc.status,
        document_type=doc.document_type,
@@ -143,28 +145,83 @@ async def upload_document(
    )
    db.add(doc)
    await db.commit()
-    await db.refresh(doc)

    background_tasks.add_task(process_document, doc_id)

+    # Re-query with selectinload so category_assignments is eagerly loaded.
+    # A new doc has no categories yet, but we need the relationship populated
+    # to avoid MissingGreenlet in the async session.
+    doc = await _get_user_doc(doc_id, user_id, db)
    return _doc_with_categories(doc)


-@router.get("", response_model=list[DocumentOut])
+_SORT_COLUMNS = {
+    "created_at": Document.created_at,
+    "processed_at": Document.processed_at,
+    "filename": Document.filename,
+    "title": Document.title,
+    "file_size": Document.file_size,
+    "status": Document.status,
+    "document_type": Document.document_type,
+}
+
+
+@router.get("", response_model=DocumentPage)
 async def list_documents(
+    page: int = Query(default=1, ge=1),
+    per_page: int = Query(default=20, ge=1, le=100),
+    sort: str = Query(default="created_at"),
+    order: str = Query(default="desc", pattern="^(asc|desc)$"),
+    status: str | None = Query(default=None),
+    document_type: str | None = Query(default=None),
+    search: str | None = Query(default=None),
    user_id: str = Depends(get_user_id),
    db: AsyncSession = Depends(get_db),
-) -> list[DocumentOut]:
-    result = await db.execute(
+) -> DocumentPage:
+    sort_col = _SORT_COLUMNS.get(sort, Document.created_at)
+    sort_expr = sort_col.desc() if order == "desc" else sort_col.asc()
+
+    # Build filter conditions once and reuse for both count + items queries.
+    conditions = [Document.user_id == user_id]
+    if status:
+        conditions.append(Document.status == status)
+    if document_type:
+        conditions.append(Document.document_type == document_type)
+    if search:
+        like = f"%{search}%"
+        conditions.append(
+            or_(
+                Document.title.ilike(like),
+                Document.filename.ilike(like),
+                Document.tags.ilike(like),
+                Document.document_type.ilike(like),
+            )
+        )
+
+    count_result = await db.execute(
+        select(func.count(Document.id)).where(*conditions)
+    )
+    total = count_result.scalar_one()
+
+    items_result = await db.execute(
        select(Document)
-        .where(Document.user_id == user_id)
+        .where(*conditions)
        .options(
            selectinload(Document.category_assignments)
            .selectinload(CategoryAssignment.category)
        )
-        .order_by(Document.created_at.desc())
+        .order_by(sort_expr)
+        .offset((page - 1) * per_page)
+        .limit(per_page)
+    )
+    items = [_doc_with_categories(d) for d in items_result.scalars().all()]
+
+    return DocumentPage(
+        items=items,
+        total=total,
+        page=page,
+        pages=max(1, math.ceil(total / per_page)),
    )
-    return [_doc_with_categories(d) for d in result.scalars().all()]


@router.get("/{doc_id}", response_model=DocumentOut)
@@ -27,6 +27,13 @@ class DocumentOut(BaseModel):
    model_config = {"from_attributes": True}


+class DocumentPage(BaseModel):
+    items: list[DocumentOut]
+    total: int
+    page: int
+    pages: int
+
+
 class DocumentStatusOut(BaseModel):
    id: str
    status: str