# Doc Service — Status ## What it is PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis. Port: `8001` (internal only, not exposed to host). All traffic arrives via the backend proxy (`backend/app/routers/documents_proxy.py`), which injects the authenticated `x-user-id` header. Database: shared PostgreSQL instance, isolated via `alembic_version_doc_service` Alembic version table. Storage: `/data/documents/` (Docker named volume `doc_data`). --- ## Current functionality ### Document lifecycle 1. `POST /documents/upload` — validate PDF, persist file to `/data/documents/{user_id}/{doc_id}.pdf`, create DB row with `status=pending`, enqueue background extraction 2. Background task: extract text with `pdfplumber` → POST to ai-service `/chat` → parse JSON result → update `status=done` (or `failed`) 3. AI extracts: `title`, `document_type`, `tags`, `suggested_categories`, plus domain fields (vendor, customer, dates, amounts, etc.) into `extracted_data` (JSON string) ### Endpoints | Method | Path | Description | |--------|------|-------------| | `POST` | `/documents/upload` | Upload PDF; returns 202 with initial doc row | | `GET` | `/documents` | Paginated list with filters, sort, and optional `category_id` filter | | `GET` | `/documents/{id}` | Single document | | `GET` | `/documents/{id}/status` | Lightweight status poll | | `GET` | `/documents/{id}/download` | Stream file bytes | | `DELETE` | `/documents/{id}` | Delete document and file | | `PATCH` | `/documents/{id}/type` | Update document type | | `PATCH` | `/documents/{id}/tags` | Replace tag list (dedup, preserve order) | | `PATCH` | `/documents/{id}/title` | Update editable title | | `GET` | `/documents/categories` | List all categories for the user | | `POST` | `/documents/categories` | Create a category; triggers re-analysis of documents in similar categories | | `PATCH` | `/documents/categories/{id}` | Rename a category | | `DELETE` | `/documents/categories/{id}` | Delete a category | | `POST` | `/documents/{id}/categories/{cat_id}` | Assign category to document | | `DELETE` | `/documents/{id}/categories/{cat_id}` | Remove category from document | ### Pagination & filtering (`GET /documents`) Query params: | Param | Default | Notes | |-------|---------|-------| | `page` | 1 | ≥ 1 | | `per_page` | 20 | 1–100 | | `sort` | `created_at` | `created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type` | | `order` | `desc` | `asc` \| `desc` | | `status` | — | filter by status string | | `document_type` | — | filter by document type | | `search` | — | case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type` | | `category_id` | — | filter to documents assigned to this category UUID | Response: `{ items: [...], total: N, page: N, pages: N }` ### Document schema ``` id UUID user_id string (from x-user-id header) filename original filename title AI-suggested editable title (nullable) file_size bytes status pending | processing | done | failed document_type AI-classified type (nullable) extracted_data JSON string — all AI-extracted fields tags JSON array string — editable tags error_message set if status=failed created_at upload timestamp processed_at when extraction finished categories many-to-many via category_assignments ``` ### AI extraction (via ai-service) System prompt and user prompt template are loaded at runtime from `doc_service_config.json` (`system_prompts` key). Defaults are built into the service and used as fallback if the config key is absent. Changes made via the AI Settings UI take effect within 30 seconds (config cache TTL). Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes: - `title` — suggested human-readable title - `document_type` — invoice / bill / receipt / order / expense / revenue / unknown - `tags` — list of keyword tags - `suggested_categories` — list of category names to suggest in the UI - Domain fields: `vendor`, `customer`, `invoice_number`, `due_date`, `total_amount`, `currency`, etc. ### Config (runtime, persisted to shared volume) `/config/doc_service_config.json`: ```json { "documents": { "max_pdf_bytes": 20971520 } } ``` Env override: `DOC_MAX_PDF_MB` ### Database migrations | Revision | Description | |----------|-------------| | 0001 | Initial schema (documents, categories, category_assignments) | | 0002 | Add `title` column to documents | Run automatically on container start via `alembic upgrade head`. --- ## Architecture ``` backend (proxy) → doc-service:8001 │ documents.py router │ ┌────────┴────────┐ upload list/get/patch │ save_upload() pdfplumber extraction │ │ Document(status=pending) ai_client.classify_document() │ │ BackgroundTask ai-service:8010/chat │ │ process_document() JSON result → update doc row ``` --- ## Known limitations / not implemented - **Re-process** — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt) - **Advanced field-level search** — `search` param matches text fields via ILIKE but does not query into `extracted_data` JSON (e.g. filter by `vendor` or `due_date`) - **Bulk operations** — no bulk category assign/remove, no bulk delete - **Document sharing** — documents are strictly per-user; no group sharing yet - **Pagination in categories** — categories are returned as a full list (no pagination) - **File type** — only PDF supported - **Concurrent uploads** — no rate limiting per user --- ## Future work - [ ] `POST /documents/{id}/reprocess` — re-run AI extraction - [ ] Advanced filter: query `extracted_data` JSON fields (vendor, due_date, amount) — requires PostgreSQL `jsonb` column or indexed virtual columns - [ ] Bulk operations endpoint - [ ] Document sharing via groups (blocked on groups/permissions system in backend) - [ ] Support additional file types (images via OCR, DOCX) - [ ] Rate limiting on upload endpoint - [ ] Soft delete with restore - [ ] Category rename / delete with cascade handling