Files
Business-Management/features/doc-service/STATUS.md
T
curo1305 c4f0c7ad49 Add priority queue to ai-service and STATUS.md workflow
- Introduce async priority queue service in ai-service; all /chat calls now route through it
- Refactor chat router to separate execute_chat (core logic) from the HTTP handler
- Add /queue endpoints (status, pause, resume, cancel) for queue management
- Update ai-service config to use Pydantic v2 model_config style
- Add STATUS.md files for backend, ai-service, doc-service, and frontend
- Document STATUS.md workflow in CLAUDE.md
- Update doc-service documents router and schemas; frontend DocumentsPage and API client

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:58:10 +02:00

144 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Doc Service — Status
## What it is
PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis.
Port: `8001` (internal only, not exposed to host). All traffic arrives via the backend proxy (`backend/app/routers/documents_proxy.py`), which injects the authenticated `x-user-id` header.
Database: shared PostgreSQL instance, isolated via `alembic_version_doc_service` Alembic version table. Storage: `/data/documents/` (Docker named volume `doc_data`).
---
## Current functionality
### Document lifecycle
1. `POST /documents/upload` — validate PDF, persist file to `/data/documents/{user_id}/{doc_id}.pdf`, create DB row with `status=pending`, enqueue background extraction
2. Background task: extract text with `pdfplumber` → POST to ai-service `/chat` → parse JSON result → update `status=done` (or `failed`)
3. AI extracts: `title`, `document_type`, `tags`, `suggested_categories`, plus domain fields (vendor, customer, dates, amounts, etc.) into `extracted_data` (JSON string)
### Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/documents/upload` | Upload PDF; returns 202 with initial doc row |
| `GET` | `/documents` | Paginated list with filters and sort |
| `GET` | `/documents/{id}` | Single document |
| `GET` | `/documents/{id}/status` | Lightweight status poll |
| `GET` | `/documents/{id}/download` | Stream file bytes |
| `DELETE` | `/documents/{id}` | Delete document and file |
| `PATCH` | `/documents/{id}/type` | Update document type |
| `PATCH` | `/documents/{id}/tags` | Replace tag list (dedup, preserve order) |
| `PATCH` | `/documents/{id}/title` | Update editable title |
| `GET` | `/documents/categories` | List all categories for the user |
| `POST` | `/documents/categories` | Create a category |
| `POST` | `/documents/{id}/categories/{cat_id}` | Assign category to document |
| `DELETE` | `/documents/{id}/categories/{cat_id}` | Remove category from document |
### Pagination & filtering (`GET /documents`)
Query params:
| Param | Default | Notes |
|-------|---------|-------|
| `page` | 1 | ≥ 1 |
| `per_page` | 20 | 1100 |
| `sort` | `created_at` | `created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type` |
| `order` | `desc` | `asc` \| `desc` |
| `status` | — | filter by status string |
| `document_type` | — | filter by document type |
| `search` | — | case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type` |
Response: `{ items: [...], total: N, page: N, pages: N }`
### Document schema
```
id UUID
user_id string (from x-user-id header)
filename original filename
title AI-suggested editable title (nullable)
file_size bytes
status pending | processing | done | failed
document_type AI-classified type (nullable)
extracted_data JSON string — all AI-extracted fields
tags JSON array string — editable tags
error_message set if status=failed
created_at upload timestamp
processed_at when extraction finished
categories many-to-many via category_assignments
```
### AI extraction (via ai-service)
Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes:
- `title` — suggested human-readable title
- `document_type` — invoice / bill / receipt / order / expense / revenue / unknown
- `tags` — list of keyword tags
- `suggested_categories` — list of category names to suggest in the UI
- Domain fields: `vendor`, `customer`, `invoice_number`, `due_date`, `total_amount`, `currency`, etc.
### Config (runtime, persisted to shared volume)
`/config/doc_service_config.json`:
```json
{ "documents": { "max_pdf_bytes": 20971520 } }
```
Env override: `DOC_MAX_PDF_MB`
### Database migrations
| Revision | Description |
|----------|-------------|
| 0001 | Initial schema (documents, categories, category_assignments) |
| 0002 | Add `title` column to documents |
Run automatically on container start via `alembic upgrade head`.
---
## Architecture
```
backend (proxy) → doc-service:8001
documents.py router
┌────────┴────────┐
upload list/get/patch
save_upload() pdfplumber extraction
│ │
Document(status=pending) ai_client.classify_document()
│ │
BackgroundTask ai-service:8010/chat
│ │
process_document() JSON result → update doc row
```
---
## Known limitations / not implemented
- **Re-process** — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt)
- **Advanced field-level search** — `search` param matches text fields via ILIKE but does not query into `extracted_data` JSON (e.g. filter by `vendor` or `due_date`)
- **Bulk operations** — no bulk category assign/remove, no bulk delete
- **Document sharing** — documents are strictly per-user; no group sharing yet
- **Pagination in categories** — categories are returned as a full list (no pagination)
- **File type** — only PDF supported
- **Concurrent uploads** — no rate limiting per user
---
## Future work
- [ ] `POST /documents/{id}/reprocess` — re-run AI extraction
- [ ] Advanced filter: query `extracted_data` JSON fields (vendor, due_date, amount) — requires PostgreSQL `jsonb` column or indexed virtual columns
- [ ] Bulk operations endpoint
- [ ] Document sharing via groups (blocked on groups/permissions system in backend)
- [ ] Support additional file types (images via OCR, DOCX)
- [ ] Rate limiting on upload endpoint
- [ ] Soft delete with restore
- [ ] Category rename / delete with cascade handling