# AI Service — Status ## What it is Shared AI intermediary container. All feature containers (doc-service, future services) POST prompts here. It routes requests to the configured model (LM Studio / Ollama / Anthropic) and returns a normalised response. It is **stateless** — no database, no conversation history. History and context are the caller's responsibility. Port: `8010` (internal only, not exposed to host). --- ## Current functionality ### Endpoints | Method | Path | Description | |--------|------|-------------| | `POST` | `/chat` | Synchronous chat: submits at NORMAL priority, blocks until done | | `GET` | `/health` | `{"status": "ok"}` | | `GET` | `/health/provider` | Active provider name, model, configured flag | | `POST` | `/queue/jobs` | Async enqueue — returns `job_id` immediately | | `GET` | `/queue/jobs/{id}` | Poll job: status, position, result, error | | `DELETE` | `/queue/jobs/{id}` | Cancel a pending job | | `GET` | `/queue/status` | Worker state: running, paused, queue_size, current_job_id | | `POST` | `/queue/pause` | Finish current job, stop picking new ones | | `POST` | `/queue/resume` | Unpause | | `POST` | `/queue/start` | Start (or restart) the worker task | | `POST` | `/queue/stop` | Stop worker (pending jobs stay queued) | ### Priority queue - Three levels: `high` (1) > `normal` (3) > `low` (5) - FIFO within same priority level (monotonic sequence counter) - Single async worker — one LLM call at a time - Pause / resume / start / stop without restarting the container - `POST /chat` is a synchronous wrapper: enqueues at NORMAL, awaits the future ### Providers | Provider | Protocol | SDK | |----------|----------|-----| | LM Studio | OpenAI-compatible HTTP | openai | | Ollama | OpenAI-compatible HTTP | openai | | Anthropic | Anthropic API (HTTPS) | anthropic | Active provider is selected by `"provider"` key in `/config/ai_service_config.json` (shared Docker volume), with env var overrides for dev. ### Configuration (env var overrides) ``` AI_PROVIDER lmstudio | ollama | anthropic LMSTUDIO_BASE_URL http://host.docker.internal:1234/v1 LMSTUDIO_API_KEY sk-lm-… LMSTUDIO_MODEL gemma-4-e4b-it ← current OLLAMA_BASE_URL / OLLAMA_MODEL / OLLAMA_API_KEY ANTHROPIC_API_KEY / ANTHROPIC_MODEL ``` Credentials live in `features/ai-service/.env` (gitignored). ### Error codes | Code | Meaning | |------|---------| | 422 | Bad request (empty messages, unknown priority) | | 502 | Provider connection / API error | | 503 | Provider not configured / unknown provider | | 504 | Provider timeout | --- ## Architecture ``` Callers (doc-service, future services) │ └─▶ POST /chat (sync) ─┐ └─▶ POST /queue/jobs (async) ─┤ ▼ asyncio.PriorityQueue (HIGH=1, NORMAL=3, LOW=5) │ QueueWorker (single task) │ execute_chat(request) │ Provider SDK (openai / anthropic) │ LM Studio / Ollama / Anthropic API ``` --- ## Known limitations / not implemented - **TLS to LM Studio** — communication is plain HTTP (`http://host.docker.internal:1234`). Deferred until LM Studio HTTPS configuration is confirmed. When ready: set `LMSTUDIO_BASE_URL=https://...` and optionally add `ssl_verify` + `ca_bundle` config keys to the OpenAI-compat provider. - **True preemption** — a HIGH job arriving while a LOW job is processing will be next in queue but will not interrupt the running inference. - **Queue persistence** — the in-memory queue is lost on container restart. Pending jobs are not persisted to disk. - **Authentication on queue endpoints** — `/queue/*` management endpoints have no auth guard. Should be protected before any public/multi-tenant deployment (internal network is the only current protection). - **Streaming responses** — `/chat` returns the full response after generation. Streaming (Server-Sent Events) not implemented. - **Metrics / observability** — no Prometheus metrics, no structured request logging per job. --- ## Future work - [ ] TLS support for LM Studio / Ollama (`ssl_verify`, `ca_bundle` config) - [ ] Auth guard on queue management endpoints (admin token or internal-only route) - [ ] Streaming responses via SSE (`POST /chat/stream`) - [ ] Queue persistence (SQLite or Redis-backed) so jobs survive restarts - [ ] Job result TTL / cleanup (currently jobs accumulate in `_jobs` dict indefinitely) - [ ] Per-caller priority override (e.g. doc-service background jobs = LOW, user-triggered = NORMAL) - [ ] Metrics endpoint (`/metrics`) for queue depth, job latency, provider error rate