Add priority queue to ai-service and STATUS.md workflow

- Introduce async priority queue service in ai-service; all /chat calls now route through it - Refactor chat router to separate execute_chat (core logic) from the HTTP handler - Add /queue endpoints (status, pause, resume, cancel) for queue management - Update ai-service config to use Pydantic v2 model_config style - Add STATUS.md files for backend, ai-service, doc-service, and frontend - Document STATUS.md workflow in CLAUDE.md - Update doc-service documents router and schemas; frontend DocumentsPage and API client Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:58:10 +02:00
parent d2495190a9
commit c4f0c7ad49
18 changed files with 1253 additions and 35 deletions
@@ -0,0 +1,112 @@
+# AI Service — Status
+
+## What it is
+
+Shared AI intermediary container. All feature containers (doc-service, future services) POST prompts here. It routes requests to the configured model (LM Studio / Ollama / Anthropic) and returns a normalised response. It is **stateless** — no database, no conversation history. History and context are the caller's responsibility.
+
+Port: `8010` (internal only, not exposed to host).
+
+---
+
+## Current functionality
+
+### Endpoints
+
+| Method | Path | Description |
+|--------|------|-------------|
+| `POST` | `/chat` | Synchronous chat: submits at NORMAL priority, blocks until done |
+| `GET` | `/health` | `{"status": "ok"}` |
+| `GET` | `/health/provider` | Active provider name, model, configured flag |
+| `POST` | `/queue/jobs` | Async enqueue — returns `job_id` immediately |
+| `GET` | `/queue/jobs/{id}` | Poll job: status, position, result, error |
+| `DELETE` | `/queue/jobs/{id}` | Cancel a pending job |
+| `GET` | `/queue/status` | Worker state: running, paused, queue_size, current_job_id |
+| `POST` | `/queue/pause` | Finish current job, stop picking new ones |
+| `POST` | `/queue/resume` | Unpause |
+| `POST` | `/queue/start` | Start (or restart) the worker task |
+| `POST` | `/queue/stop` | Stop worker (pending jobs stay queued) |
+
+### Priority queue
+
+- Three levels: `high` (1) > `normal` (3) > `low` (5)
+- FIFO within same priority level (monotonic sequence counter)
+- Single async worker — one LLM call at a time
+- Pause / resume / start / stop without restarting the container
+- `POST /chat` is a synchronous wrapper: enqueues at NORMAL, awaits the future
+
+### Providers
+
+| Provider | Protocol | SDK |
+|----------|----------|-----|
+| LM Studio | OpenAI-compatible HTTP | openai |
+| Ollama | OpenAI-compatible HTTP | openai |
+| Anthropic | Anthropic API (HTTPS) | anthropic |
+
+Active provider is selected by `"provider"` key in `/config/ai_service_config.json` (shared Docker volume), with env var overrides for dev.
+
+### Configuration (env var overrides)
+
+```
+AI_PROVIDER          lmstudio | ollama | anthropic
+LMSTUDIO_BASE_URL    http://host.docker.internal:1234/v1
+LMSTUDIO_API_KEY     sk-lm-…
+LMSTUDIO_MODEL       gemma-4-e4b-it          ← current
+OLLAMA_BASE_URL / OLLAMA_MODEL / OLLAMA_API_KEY
+ANTHROPIC_API_KEY / ANTHROPIC_MODEL
+```
+
+Credentials live in `features/ai-service/.env` (gitignored).
+
+### Error codes
+
+| Code | Meaning |
+|------|---------|
+| 422 | Bad request (empty messages, unknown priority) |
+| 502 | Provider connection / API error |
+| 503 | Provider not configured / unknown provider |
+| 504 | Provider timeout |
+
+---
+
+## Architecture
+
+```
+Callers (doc-service, future services)
+    │
+    └─▶ POST /chat (sync)         ─┐
+    └─▶ POST /queue/jobs (async)  ─┤
+                                   ▼
+                        asyncio.PriorityQueue
+                        (HIGH=1, NORMAL=3, LOW=5)
+                                   │
+                        QueueWorker (single task)
+                                   │
+                        execute_chat(request)
+                                   │
+                        Provider SDK (openai / anthropic)
+                                   │
+                        LM Studio / Ollama / Anthropic API
+```
+
+---
+
+## Known limitations / not implemented
+
+- **TLS to LM Studio** — communication is plain HTTP (`http://host.docker.internal:1234`). Deferred until LM Studio HTTPS configuration is confirmed. When ready: set `LMSTUDIO_BASE_URL=https://...` and optionally add `ssl_verify` + `ca_bundle` config keys to the OpenAI-compat provider.
+- **True preemption** — a HIGH job arriving while a LOW job is processing will be next in queue but will not interrupt the running inference.
+- **Queue persistence** — the in-memory queue is lost on container restart. Pending jobs are not persisted to disk.
+- **Authentication on queue endpoints** — `/queue/*` management endpoints have no auth guard. Should be protected before any public/multi-tenant deployment (internal network is the only current protection).
+- **Streaming responses** — `/chat` returns the full response after generation. Streaming (Server-Sent Events) not implemented.
+- **Metrics / observability** — no Prometheus metrics, no structured request logging per job.
+
+---
+
+## Future work
+
+- [ ] TLS support for LM Studio / Ollama (`ssl_verify`, `ca_bundle` config)
+- [ ] Auth guard on queue management endpoints (admin token or internal-only route)
+- [ ] Streaming responses via SSE (`POST /chat/stream`)
+- [ ] Queue persistence (SQLite or Redis-backed) so jobs survive restarts
+- [ ] Job result TTL / cleanup (currently jobs accumulate in `_jobs` dict indefinitely)
+- [ ] Per-caller priority override (e.g. doc-service background jobs = LOW, user-triggered = NORMAL)
+- [ ] Metrics endpoint (`/metrics`) for queue depth, job latency, provider error rate