Add priority queue to ai-service and STATUS.md workflow
- Introduce async priority queue service in ai-service; all /chat calls now route through it - Refactor chat router to separate execute_chat (core logic) from the HTTP handler - Add /queue endpoints (status, pause, resume, cancel) for queue management - Update ai-service config to use Pydantic v2 model_config style - Add STATUS.md files for backend, ai-service, doc-service, and frontend - Document STATUS.md workflow in CLAUDE.md - Update doc-service documents router and schemas; frontend DocumentsPage and API client Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,143 @@
|
||||
# Doc Service — Status
|
||||
|
||||
## What it is
|
||||
|
||||
PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis.
|
||||
|
||||
Port: `8001` (internal only, not exposed to host). All traffic arrives via the backend proxy (`backend/app/routers/documents_proxy.py`), which injects the authenticated `x-user-id` header.
|
||||
|
||||
Database: shared PostgreSQL instance, isolated via `alembic_version_doc_service` Alembic version table. Storage: `/data/documents/` (Docker named volume `doc_data`).
|
||||
|
||||
---
|
||||
|
||||
## Current functionality
|
||||
|
||||
### Document lifecycle
|
||||
|
||||
1. `POST /documents/upload` — validate PDF, persist file to `/data/documents/{user_id}/{doc_id}.pdf`, create DB row with `status=pending`, enqueue background extraction
|
||||
2. Background task: extract text with `pdfplumber` → POST to ai-service `/chat` → parse JSON result → update `status=done` (or `failed`)
|
||||
3. AI extracts: `title`, `document_type`, `tags`, `suggested_categories`, plus domain fields (vendor, customer, dates, amounts, etc.) into `extracted_data` (JSON string)
|
||||
|
||||
### Endpoints
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| `POST` | `/documents/upload` | Upload PDF; returns 202 with initial doc row |
|
||||
| `GET` | `/documents` | Paginated list with filters and sort |
|
||||
| `GET` | `/documents/{id}` | Single document |
|
||||
| `GET` | `/documents/{id}/status` | Lightweight status poll |
|
||||
| `GET` | `/documents/{id}/download` | Stream file bytes |
|
||||
| `DELETE` | `/documents/{id}` | Delete document and file |
|
||||
| `PATCH` | `/documents/{id}/type` | Update document type |
|
||||
| `PATCH` | `/documents/{id}/tags` | Replace tag list (dedup, preserve order) |
|
||||
| `PATCH` | `/documents/{id}/title` | Update editable title |
|
||||
| `GET` | `/documents/categories` | List all categories for the user |
|
||||
| `POST` | `/documents/categories` | Create a category |
|
||||
| `POST` | `/documents/{id}/categories/{cat_id}` | Assign category to document |
|
||||
| `DELETE` | `/documents/{id}/categories/{cat_id}` | Remove category from document |
|
||||
|
||||
### Pagination & filtering (`GET /documents`)
|
||||
|
||||
Query params:
|
||||
|
||||
| Param | Default | Notes |
|
||||
|-------|---------|-------|
|
||||
| `page` | 1 | ≥ 1 |
|
||||
| `per_page` | 20 | 1–100 |
|
||||
| `sort` | `created_at` | `created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type` |
|
||||
| `order` | `desc` | `asc` \| `desc` |
|
||||
| `status` | — | filter by status string |
|
||||
| `document_type` | — | filter by document type |
|
||||
| `search` | — | case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type` |
|
||||
|
||||
Response: `{ items: [...], total: N, page: N, pages: N }`
|
||||
|
||||
### Document schema
|
||||
|
||||
```
|
||||
id UUID
|
||||
user_id string (from x-user-id header)
|
||||
filename original filename
|
||||
title AI-suggested editable title (nullable)
|
||||
file_size bytes
|
||||
status pending | processing | done | failed
|
||||
document_type AI-classified type (nullable)
|
||||
extracted_data JSON string — all AI-extracted fields
|
||||
tags JSON array string — editable tags
|
||||
error_message set if status=failed
|
||||
created_at upload timestamp
|
||||
processed_at when extraction finished
|
||||
categories many-to-many via category_assignments
|
||||
```
|
||||
|
||||
### AI extraction (via ai-service)
|
||||
|
||||
Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:
|
||||
- `title` — suggested human-readable title
|
||||
- `document_type` — invoice / bill / receipt / order / expense / revenue / unknown
|
||||
- `tags` — list of keyword tags
|
||||
- `suggested_categories` — list of category names to suggest in the UI
|
||||
- Domain fields: `vendor`, `customer`, `invoice_number`, `due_date`, `total_amount`, `currency`, etc.
|
||||
|
||||
### Config (runtime, persisted to shared volume)
|
||||
|
||||
`/config/doc_service_config.json`:
|
||||
```json
|
||||
{ "documents": { "max_pdf_bytes": 20971520 } }
|
||||
```
|
||||
Env override: `DOC_MAX_PDF_MB`
|
||||
|
||||
### Database migrations
|
||||
|
||||
| Revision | Description |
|
||||
|----------|-------------|
|
||||
| 0001 | Initial schema (documents, categories, category_assignments) |
|
||||
| 0002 | Add `title` column to documents |
|
||||
|
||||
Run automatically on container start via `alembic upgrade head`.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
backend (proxy) → doc-service:8001
|
||||
│
|
||||
documents.py router
|
||||
│
|
||||
┌────────┴────────┐
|
||||
upload list/get/patch
|
||||
│
|
||||
save_upload() pdfplumber extraction
|
||||
│ │
|
||||
Document(status=pending) ai_client.classify_document()
|
||||
│ │
|
||||
BackgroundTask ai-service:8010/chat
|
||||
│ │
|
||||
process_document() JSON result → update doc row
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Known limitations / not implemented
|
||||
|
||||
- **Re-process** — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
|
||||
- **Advanced field-level search** — `search` param matches text fields via ILIKE but does not query into `extracted_data` JSON (e.g. filter by `vendor` or `due_date`)
|
||||
- **Bulk operations** — no bulk category assign/remove, no bulk delete
|
||||
- **Document sharing** — documents are strictly per-user; no group sharing yet
|
||||
- **Pagination in categories** — categories are returned as a full list (no pagination)
|
||||
- **File type** — only PDF supported
|
||||
- **Concurrent uploads** — no rate limiting per user
|
||||
|
||||
---
|
||||
|
||||
## Future work
|
||||
|
||||
- [ ] `POST /documents/{id}/reprocess` — re-run AI extraction
|
||||
- [ ] Advanced filter: query `extracted_data` JSON fields (vendor, due_date, amount) — requires PostgreSQL `jsonb` column or indexed virtual columns
|
||||
- [ ] Bulk operations endpoint
|
||||
- [ ] Document sharing via groups (blocked on groups/permissions system in backend)
|
||||
- [ ] Support additional file types (images via OCR, DOCX)
|
||||
- [ ] Rate limiting on upload endpoint
|
||||
- [ ] Soft delete with restore
|
||||
- [ ] Category rename / delete with cascade handling
|
||||
Reference in New Issue
Block a user