Files

T

curo1305 c4f0c7ad49 Add priority queue to ai-service and STATUS.md workflow

- Introduce async priority queue service in ai-service; all /chat calls now route through it
- Refactor chat router to separate execute_chat (core logic) from the HTTP handler
- Add /queue endpoints (status, pause, resume, cancel) for queue management
- Update ai-service config to use Pydantic v2 model_config style
- Add STATUS.md files for backend, ai-service, doc-service, and frontend
- Document STATUS.md workflow in CLAUDE.md
- Update doc-service documents router and schemas; frontend DocumentsPage and API client

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-14 22:58:10 +02:00

4.8 KiB

Raw Blame History

AI Service — Status

What it is

Shared AI intermediary container. All feature containers (doc-service, future services) POST prompts here. It routes requests to the configured model (LM Studio / Ollama / Anthropic) and returns a normalised response. It is stateless — no database, no conversation history. History and context are the caller's responsibility.

Port: 8010 (internal only, not exposed to host).

Current functionality

Endpoints

Method	Path	Description
`POST`	`/chat`	Synchronous chat: submits at NORMAL priority, blocks until done
`GET`	`/health`	`{"status": "ok"}`
`GET`	`/health/provider`	Active provider name, model, configured flag
`POST`	`/queue/jobs`	Async enqueue — returns `job_id` immediately
`GET`	`/queue/jobs/{id}`	Poll job: status, position, result, error
`DELETE`	`/queue/jobs/{id}`	Cancel a pending job
`GET`	`/queue/status`	Worker state: running, paused, queue_size, current_job_id
`POST`	`/queue/pause`	Finish current job, stop picking new ones
`POST`	`/queue/resume`	Unpause
`POST`	`/queue/start`	Start (or restart) the worker task
`POST`	`/queue/stop`	Stop worker (pending jobs stay queued)

Priority queue

Three levels: high (1) > normal (3) > low (5)
FIFO within same priority level (monotonic sequence counter)
Single async worker — one LLM call at a time
Pause / resume / start / stop without restarting the container
POST /chat is a synchronous wrapper: enqueues at NORMAL, awaits the future

Providers

Provider	Protocol	SDK
LM Studio	OpenAI-compatible HTTP	openai
Ollama	OpenAI-compatible HTTP	openai
Anthropic	Anthropic API (HTTPS)	anthropic

Active provider is selected by "provider" key in /config/ai_service_config.json (shared Docker volume), with env var overrides for dev.

Configuration (env var overrides)

AI_PROVIDER          lmstudio | ollama | anthropic
LMSTUDIO_BASE_URL    http://host.docker.internal:1234/v1
LMSTUDIO_API_KEY     sk-lm-…
LMSTUDIO_MODEL       gemma-4-e4b-it          ← current
OLLAMA_BASE_URL / OLLAMA_MODEL / OLLAMA_API_KEY
ANTHROPIC_API_KEY / ANTHROPIC_MODEL

Credentials live in features/ai-service/.env (gitignored).

Error codes

Code	Meaning
422	Bad request (empty messages, unknown priority)
502	Provider connection / API error
503	Provider not configured / unknown provider
504	Provider timeout

Architecture

Callers (doc-service, future services)
    │
    └─▶ POST /chat (sync)         ─┐
    └─▶ POST /queue/jobs (async)  ─┤
                                   ▼
                        asyncio.PriorityQueue
                        (HIGH=1, NORMAL=3, LOW=5)
                                   │
                        QueueWorker (single task)
                                   │
                        execute_chat(request)
                                   │
                        Provider SDK (openai / anthropic)
                                   │
                        LM Studio / Ollama / Anthropic API

Known limitations / not implemented

TLS to LM Studio — communication is plain HTTP (http://host.docker.internal:1234). Deferred until LM Studio HTTPS configuration is confirmed. When ready: set LMSTUDIO_BASE_URL=https://... and optionally add ssl_verify + ca_bundle config keys to the OpenAI-compat provider.
True preemption — a HIGH job arriving while a LOW job is processing will be next in queue but will not interrupt the running inference.
Queue persistence — the in-memory queue is lost on container restart. Pending jobs are not persisted to disk.
Authentication on queue endpoints — /queue/* management endpoints have no auth guard. Should be protected before any public/multi-tenant deployment (internal network is the only current protection).
Streaming responses — /chat returns the full response after generation. Streaming (Server-Sent Events) not implemented.
Metrics / observability — no Prometheus metrics, no structured request logging per job.

Future work

TLS support for LM Studio / Ollama (ssl_verify, ca_bundle config)
Auth guard on queue management endpoints (admin token or internal-only route)
Streaming responses via SSE (POST /chat/stream)
Queue persistence (SQLite or Redis-backed) so jobs survive restarts
Job result TTL / cleanup (currently jobs accumulate in _jobs dict indefinitely)
Per-caller priority override (e.g. doc-service background jobs = LOW, user-triggered = NORMAL)
Metrics endpoint (/metrics) for queue depth, job latency, provider error rate

4.8 KiB Raw Blame History