# Doc Service — Status ## What it is PDF document management microservice. Handles upload, storage, async AI-powered extraction, tagging, categorisation, and retrieval of PDF documents on a per-user basis. Port: `8001` (internal only, not exposed to host). All traffic arrives via the backend proxy (`backend/app/routers/documents_proxy.py`), which injects the authenticated `x-user-id` header. Database: shared PostgreSQL instance, isolated via `alembic_version_doc_service` Alembic version table. Storage: `/data/documents/` (Docker named volume `doc_data`). --- ## Current functionality ### Document lifecycle 1. `POST /documents/upload` — validate PDF, persist file to `/data/documents/{user_id}/{doc_id}.pdf`, create DB row with `status=pending`, enqueue background extraction 2. Background task: extract text with `pdfplumber` → POST to ai-service `/chat` → parse JSON result → update `status=done` (or `failed`) 3. AI extracts: `title`, `document_type`, `tags`, `suggested_categories`, plus domain fields (vendor, customer, dates, amounts, etc.) into `extracted_data` (JSON string) ### Endpoints | Method | Path | Description | |--------|------|-------------| | `POST` | `/documents/upload` | Upload PDF; returns 202 with initial doc row | | `GET` | `/documents` | Paginated list with filters and sort | | `GET` | `/documents/{id}` | Single document | | `GET` | `/documents/{id}/status` | Lightweight status poll | | `GET` | `/documents/{id}/download` | Stream file bytes | | `DELETE` | `/documents/{id}` | Delete document and file | | `PATCH` | `/documents/{id}/type` | Update document type | | `PATCH` | `/documents/{id}/tags` | Replace tag list (dedup, preserve order) | | `PATCH` | `/documents/{id}/title` | Update editable title | | `GET` | `/documents/categories` | List all categories for the user | | `POST` | `/documents/categories` | Create a category | | `POST` | `/documents/{id}/categories/{cat_id}` | Assign category to document | | `DELETE` | `/documents/{id}/categories/{cat_id}` | Remove category from document | ### Pagination & filtering (`GET /documents`) Query params: | Param | Default | Notes | |-------|---------|-------| | `page` | 1 | ≥ 1 | | `per_page` | 20 | 1–100 | | `sort` | `created_at` | `created_at`, `processed_at`, `filename`, `title`, `file_size`, `status`, `document_type` | | `order` | `desc` | `asc` \| `desc` | | `status` | — | filter by status string | | `document_type` | — | filter by document type | | `search` | — | case-insensitive ILIKE on `title`, `filename`, `tags`, `document_type` | Response: `{ items: [...], total: N, page: N, pages: N }` ### Document schema ``` id UUID user_id string (from x-user-id header) filename original filename title AI-suggested editable title (nullable) file_size bytes status pending | processing | done | failed document_type AI-classified type (nullable) extracted_data JSON string — all AI-extracted fields tags JSON array string — editable tags error_message set if status=failed created_at upload timestamp processed_at when extraction finished categories many-to-many via category_assignments ``` ### AI extraction (via ai-service) Prompt sends the first 50 000 chars of extracted text. Expected JSON response includes: - `title` — suggested human-readable title - `document_type` — invoice / bill / receipt / order / expense / revenue / unknown - `tags` — list of keyword tags - `suggested_categories` — list of category names to suggest in the UI - Domain fields: `vendor`, `customer`, `invoice_number`, `due_date`, `total_amount`, `currency`, etc. ### Config (runtime, persisted to shared volume) `/config/doc_service_config.json`: ```json { "documents": { "max_pdf_bytes": 20971520 } } ``` Env override: `DOC_MAX_PDF_MB` ### Database migrations | Revision | Description | |----------|-------------| | 0001 | Initial schema (documents, categories, category_assignments) | | 0002 | Add `title` column to documents | Run automatically on container start via `alembic upgrade head`. --- ## Architecture ``` backend (proxy) → doc-service:8001 │ documents.py router │ ┌────────┴────────┐ upload list/get/patch │ save_upload() pdfplumber extraction │ │ Document(status=pending) ai_client.classify_document() │ │ BackgroundTask ai-service:8010/chat │ │ process_document() JSON result → update doc row ``` --- ## Known limitations / not implemented - **Re-process** — no endpoint to re-trigger AI extraction on an existing document (e.g. after changing the AI model or prompt) - **Advanced field-level search** — `search` param matches text fields via ILIKE but does not query into `extracted_data` JSON (e.g. filter by `vendor` or `due_date`) - **Bulk operations** — no bulk category assign/remove, no bulk delete - **Document sharing** — documents are strictly per-user; no group sharing yet - **Pagination in categories** — categories are returned as a full list (no pagination) - **File type** — only PDF supported - **Concurrent uploads** — no rate limiting per user --- ## Future work - [ ] `POST /documents/{id}/reprocess` — re-run AI extraction - [ ] Advanced filter: query `extracted_data` JSON fields (vendor, due_date, amount) — requires PostgreSQL `jsonb` column or indexed virtual columns - [ ] Bulk operations endpoint - [ ] Document sharing via groups (blocked on groups/permissions system in backend) - [ ] Support additional file types (images via OCR, DOCX) - [ ] Rate limiting on upload endpoint - [ ] Soft delete with restore - [ ] Category rename / delete with cascade handling