c4f0c7ad49
- Introduce async priority queue service in ai-service; all /chat calls now route through it - Refactor chat router to separate execute_chat (core logic) from the HTTP handler - Add /queue endpoints (status, pause, resume, cancel) for queue management - Update ai-service config to use Pydantic v2 model_config style - Add STATUS.md files for backend, ai-service, doc-service, and frontend - Document STATUS.md workflow in CLAUDE.md - Update doc-service documents router and schemas; frontend DocumentsPage and API client Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.8 KiB
4.8 KiB
AI Service — Status
What it is
Shared AI intermediary container. All feature containers (doc-service, future services) POST prompts here. It routes requests to the configured model (LM Studio / Ollama / Anthropic) and returns a normalised response. It is stateless — no database, no conversation history. History and context are the caller's responsibility.
Port: 8010 (internal only, not exposed to host).
Current functionality
Endpoints
| Method | Path | Description |
|---|---|---|
POST |
/chat |
Synchronous chat: submits at NORMAL priority, blocks until done |
GET |
/health |
{"status": "ok"} |
GET |
/health/provider |
Active provider name, model, configured flag |
POST |
/queue/jobs |
Async enqueue — returns job_id immediately |
GET |
/queue/jobs/{id} |
Poll job: status, position, result, error |
DELETE |
/queue/jobs/{id} |
Cancel a pending job |
GET |
/queue/status |
Worker state: running, paused, queue_size, current_job_id |
POST |
/queue/pause |
Finish current job, stop picking new ones |
POST |
/queue/resume |
Unpause |
POST |
/queue/start |
Start (or restart) the worker task |
POST |
/queue/stop |
Stop worker (pending jobs stay queued) |
Priority queue
- Three levels:
high(1) >normal(3) >low(5) - FIFO within same priority level (monotonic sequence counter)
- Single async worker — one LLM call at a time
- Pause / resume / start / stop without restarting the container
POST /chatis a synchronous wrapper: enqueues at NORMAL, awaits the future
Providers
| Provider | Protocol | SDK |
|---|---|---|
| LM Studio | OpenAI-compatible HTTP | openai |
| Ollama | OpenAI-compatible HTTP | openai |
| Anthropic | Anthropic API (HTTPS) | anthropic |
Active provider is selected by "provider" key in /config/ai_service_config.json (shared Docker volume), with env var overrides for dev.
Configuration (env var overrides)
AI_PROVIDER lmstudio | ollama | anthropic
LMSTUDIO_BASE_URL http://host.docker.internal:1234/v1
LMSTUDIO_API_KEY sk-lm-…
LMSTUDIO_MODEL gemma-4-e4b-it ← current
OLLAMA_BASE_URL / OLLAMA_MODEL / OLLAMA_API_KEY
ANTHROPIC_API_KEY / ANTHROPIC_MODEL
Credentials live in features/ai-service/.env (gitignored).
Error codes
| Code | Meaning |
|---|---|
| 422 | Bad request (empty messages, unknown priority) |
| 502 | Provider connection / API error |
| 503 | Provider not configured / unknown provider |
| 504 | Provider timeout |
Architecture
Callers (doc-service, future services)
│
└─▶ POST /chat (sync) ─┐
└─▶ POST /queue/jobs (async) ─┤
▼
asyncio.PriorityQueue
(HIGH=1, NORMAL=3, LOW=5)
│
QueueWorker (single task)
│
execute_chat(request)
│
Provider SDK (openai / anthropic)
│
LM Studio / Ollama / Anthropic API
Known limitations / not implemented
- TLS to LM Studio — communication is plain HTTP (
http://host.docker.internal:1234). Deferred until LM Studio HTTPS configuration is confirmed. When ready: setLMSTUDIO_BASE_URL=https://...and optionally addssl_verify+ca_bundleconfig keys to the OpenAI-compat provider. - True preemption — a HIGH job arriving while a LOW job is processing will be next in queue but will not interrupt the running inference.
- Queue persistence — the in-memory queue is lost on container restart. Pending jobs are not persisted to disk.
- Authentication on queue endpoints —
/queue/*management endpoints have no auth guard. Should be protected before any public/multi-tenant deployment (internal network is the only current protection). - Streaming responses —
/chatreturns the full response after generation. Streaming (Server-Sent Events) not implemented. - Metrics / observability — no Prometheus metrics, no structured request logging per job.
Future work
- TLS support for LM Studio / Ollama (
ssl_verify,ca_bundleconfig) - Auth guard on queue management endpoints (admin token or internal-only route)
- Streaming responses via SSE (
POST /chat/stream) - Queue persistence (SQLite or Redis-backed) so jobs survive restarts
- Job result TTL / cleanup (currently jobs accumulate in
_jobsdict indefinitely) - Per-caller priority override (e.g. doc-service background jobs = LOW, user-triggered = NORMAL)
- Metrics endpoint (
/metrics) for queue depth, job latency, provider error rate