Add priority queue to ai-service and STATUS.md workflow

- Introduce async priority queue service in ai-service; all /chat calls now route through it
- Refactor chat router to separate execute_chat (core logic) from the HTTP handler
- Add /queue endpoints (status, pause, resume, cancel) for queue management
- Update ai-service config to use Pydantic v2 model_config style
- Add STATUS.md files for backend, ai-service, doc-service, and frontend
- Document STATUS.md workflow in CLAUDE.md
- Update doc-service documents router and schemas; frontend DocumentsPage and API client

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
curo1305
2026-04-14 22:58:10 +02:00
parent d2495190a9
commit c4f0c7ad49
18 changed files with 1253 additions and 35 deletions
+143
View File
@@ -0,0 +1,143 @@
# Doc Service — Status
## What it is
PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis.
Port: `8001` (internal only, not exposed to host). All traffic arrives via the backend proxy (`backend/app/routers/documents_proxy.py`), which injects the authenticated `x-user-id` header.
Database: shared PostgreSQL instance, isolated via `alembic_version_doc_service` Alembic version table. Storage: `/data/documents/` (Docker named volume `doc_data`).
---
## Current functionality
### Document lifecycle
1. `POST /documents/upload` — validate PDF, persist file to `/data/documents/{user_id}/{doc_id}.pdf`, create DB row with `status=pending`, enqueue background extraction
2. Background task: extract text with `pdfplumber` → POST to ai-service `/chat` → parse JSON result → update `status=done` (or `failed`)
3. AI extracts: `title`, `document_type`, `tags`, `suggested_categories`, plus domain fields (vendor, customer, dates, amounts, etc.) into `extracted_data` (JSON string)
### Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/documents/upload` | Upload PDF; returns 202 with initial doc row |
| `GET` | `/documents` | Paginated list with filters and sort |
| `GET` | `/documents/{id}` | Single document |
| `GET` | `/documents/{id}/status` | Lightweight status poll |
| `GET` | `/documents/{id}/download` | Stream file bytes |
| `DELETE` | `/documents/{id}` | Delete document and file |
| `PATCH` | `/documents/{id}/type` | Update document type |
| `PATCH` | `/documents/{id}/tags` | Replace tag list (dedup, preserve order) |
| `PATCH` | `/documents/{id}/title` | Update editable title |
| `GET` | `/documents/categories` | List all categories for the user |
| `POST` | `/documents/categories` | Create a category |
| `POST` | `/documents/{id}/categories/{cat_id}` | Assign category to document |
| `DELETE` | `/documents/{id}/categories/{cat_id}` | Remove category from document |
### Pagination & filtering (`GET /documents`)
Query params:
| Param | Default | Notes |
|-------|---------|-------|
| `page` | 1 | ≥ 1 |
| `per_page` | 20 | 1100 |
| `sort` | `created_at` | `created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type` |
| `order` | `desc` | `asc` \| `desc` |
| `status` | — | filter by status string |
| `document_type` | — | filter by document type |
| `search` | — | case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type` |
Response: `{ items: [...], total: N, page: N, pages: N }`
### Document schema
```
id UUID
user_id string (from x-user-id header)
filename original filename
title AI-suggested editable title (nullable)
file_size bytes
status pending | processing | done | failed
document_type AI-classified type (nullable)
extracted_data JSON string — all AI-extracted fields
tags JSON array string — editable tags
error_message set if status=failed
created_at upload timestamp
processed_at when extraction finished
categories many-to-many via category_assignments
```
### AI extraction (via ai-service)
Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:
- `title` — suggested human-readable title
- `document_type` — invoice / bill / receipt / order / expense / revenue / unknown
- `tags` — list of keyword tags
- `suggested_categories` — list of category names to suggest in the UI
- Domain fields: `vendor`, `customer`, `invoice_number`, `due_date`, `total_amount`, `currency`, etc.
### Config (runtime, persisted to shared volume)
`/config/doc_service_config.json`:
```json
{ "documents": { "max_pdf_bytes": 20971520 } }
```
Env override: `DOC_MAX_PDF_MB`
### Database migrations
| Revision | Description |
|----------|-------------|
| 0001 | Initial schema (documents, categories, category_assignments) |
| 0002 | Add `title` column to documents |
Run automatically on container start via `alembic upgrade head`.
---
## Architecture
```
backend (proxy) → doc-service:8001
documents.py router
┌────────┴────────┐
upload list/get/patch
save_upload() pdfplumber extraction
│ │
Document(status=pending) ai_client.classify_document()
│ │
BackgroundTask ai-service:8010/chat
│ │
process_document() JSON result → update doc row
```
---
## Known limitations / not implemented
- **Re-process** — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
- **Advanced field-level search** — `search` param matches text fields via ILIKE but does not query into `extracted_data` JSON (e.g. filter by `vendor` or `due_date`)
- **Bulk operations** — no bulk category assign/remove, no bulk delete
- **Document sharing** — documents are strictly per-user; no group sharing yet
- **Pagination in categories** — categories are returned as a full list (no pagination)
- **File type** — only PDF supported
- **Concurrent uploads** — no rate limiting per user
---
## Future work
- [ ] `POST /documents/{id}/reprocess` — re-run AI extraction
- [ ] Advanced filter: query `extracted_data` JSON fields (vendor, due_date, amount) — requires PostgreSQL `jsonb` column or indexed virtual columns
- [ ] Bulk operations endpoint
- [ ] Document sharing via groups (blocked on groups/permissions system in backend)
- [ ] Support additional file types (images via OCR, DOCX)
- [ ] Rate limiting on upload endpoint
- [ ] Soft delete with restore
- [ ] Category rename / delete with cascade handling
+67 -10
View File
@@ -1,13 +1,14 @@
import asyncio
import json
import math
import uuid
from datetime import datetime, timezone
import aiofiles
import pdfplumber
from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, UploadFile
from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, Query, UploadFile
from fastapi.responses import StreamingResponse
from sqlalchemy import select
from sqlalchemy import func, or_, select
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import selectinload
@@ -16,7 +17,7 @@ from app.deps import get_user_id
from app.models.category import DocumentCategory
from app.models.category_assignment import CategoryAssignment
from app.models.document import Document
from app.schemas.document import DocumentOut, DocumentStatusOut, DocumentTypeUpdate, TagsUpdate, TitleUpdate
from app.schemas.document import DocumentOut, DocumentPage, DocumentStatusOut, DocumentTypeUpdate, TagsUpdate, TitleUpdate
from app.services.ai_client import AIServiceError, classify_document
from app.services.config_reader import load_doc_config
from app.services.storage import delete_file, get_upload_path, save_upload
@@ -50,6 +51,7 @@ def _doc_with_categories(doc: Document) -> DocumentOut:
id=doc.id,
user_id=doc.user_id,
filename=doc.filename,
title=doc.title,
file_size=doc.file_size,
status=doc.status,
document_type=doc.document_type,
@@ -143,28 +145,83 @@ async def upload_document(
)
db.add(doc)
await db.commit()
await db.refresh(doc)
background_tasks.add_task(process_document, doc_id)
# Re-query with selectinload so category_assignments is eagerly loaded.
# A new doc has no categories yet, but we need the relationship populated
# to avoid MissingGreenlet in the async session.
doc = await _get_user_doc(doc_id, user_id, db)
return _doc_with_categories(doc)
@router.get("", response_model=list[DocumentOut])
_SORT_COLUMNS = {
"created_at": Document.created_at,
"processed_at": Document.processed_at,
"filename": Document.filename,
"title": Document.title,
"file_size": Document.file_size,
"status": Document.status,
"document_type": Document.document_type,
}
@router.get("", response_model=DocumentPage)
async def list_documents(
page: int = Query(default=1, ge=1),
per_page: int = Query(default=20, ge=1, le=100),
sort: str = Query(default="created_at"),
order: str = Query(default="desc", pattern="^(asc|desc)$"),
status: str | None = Query(default=None),
document_type: str | None = Query(default=None),
search: str | None = Query(default=None),
user_id: str = Depends(get_user_id),
db: AsyncSession = Depends(get_db),
) -> list[DocumentOut]:
result = await db.execute(
) -> DocumentPage:
sort_col = _SORT_COLUMNS.get(sort, Document.created_at)
sort_expr = sort_col.desc() if order == "desc" else sort_col.asc()
# Build filter conditions once and reuse for both count + items queries.
conditions = [Document.user_id == user_id]
if status:
conditions.append(Document.status == status)
if document_type:
conditions.append(Document.document_type == document_type)
if search:
like = f"%{search}%"
conditions.append(
or_(
Document.title.ilike(like),
Document.filename.ilike(like),
Document.tags.ilike(like),
Document.document_type.ilike(like),
)
)
count_result = await db.execute(
select(func.count(Document.id)).where(*conditions)
)
total = count_result.scalar_one()
items_result = await db.execute(
select(Document)
.where(Document.user_id == user_id)
.where(*conditions)
.options(
selectinload(Document.category_assignments)
.selectinload(CategoryAssignment.category)
)
.order_by(Document.created_at.desc())
.order_by(sort_expr)
.offset((page - 1) * per_page)
.limit(per_page)
)
items = [_doc_with_categories(d) for d in items_result.scalars().all()]
return DocumentPage(
items=items,
total=total,
page=page,
pages=max(1, math.ceil(total / per_page)),
)
return [_doc_with_categories(d) for d in result.scalars().all()]
@router.get("/{doc_id}", response_model=DocumentOut)
@@ -27,6 +27,13 @@ class DocumentOut(BaseModel):
model_config = {"from_attributes": True}
class DocumentPage(BaseModel):
items: list[DocumentOut]
total: int
page: int
pages: int
class DocumentStatusOut(BaseModel):
id: str
status: str