Files
Business-Management/features/ai-service/STATUS.md
T
curo1305 c4f0c7ad49 Add priority queue to ai-service and STATUS.md workflow
- Introduce async priority queue service in ai-service; all /chat calls now route through it
- Refactor chat router to separate execute_chat (core logic) from the HTTP handler
- Add /queue endpoints (status, pause, resume, cancel) for queue management
- Update ai-service config to use Pydantic v2 model_config style
- Add STATUS.md files for backend, ai-service, doc-service, and frontend
- Document STATUS.md workflow in CLAUDE.md
- Update doc-service documents router and schemas; frontend DocumentsPage and API client

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 22:58:10 +02:00

4.8 KiB

AI Service — Status

What it is

Shared AI intermediary container. All feature containers (doc-service, future services) POST prompts here. It routes requests to the configured model (LM Studio / Ollama / Anthropic) and returns a normalised response. It is stateless — no database, no conversation history. History and context are the caller's responsibility.

Port: 8010 (internal only, not exposed to host).


Current functionality

Endpoints

Method Path Description
POST /chat Synchronous chat: submits at NORMAL priority, blocks until done
GET /health {"status": "ok"}
GET /health/provider Active provider name, model, configured flag
POST /queue/jobs Async enqueue — returns job_id immediately
GET /queue/jobs/{id} Poll job: status, position, result, error
DELETE /queue/jobs/{id} Cancel a pending job
GET /queue/status Worker state: running, paused, queue_size, current_job_id
POST /queue/pause Finish current job, stop picking new ones
POST /queue/resume Unpause
POST /queue/start Start (or restart) the worker task
POST /queue/stop Stop worker (pending jobs stay queued)

Priority queue

  • Three levels: high (1) > normal (3) > low (5)
  • FIFO within same priority level (monotonic sequence counter)
  • Single async worker — one LLM call at a time
  • Pause / resume / start / stop without restarting the container
  • POST /chat is a synchronous wrapper: enqueues at NORMAL, awaits the future

Providers

Provider Protocol SDK
LM Studio OpenAI-compatible HTTP openai
Ollama OpenAI-compatible HTTP openai
Anthropic Anthropic API (HTTPS) anthropic

Active provider is selected by "provider" key in /config/ai_service_config.json (shared Docker volume), with env var overrides for dev.

Configuration (env var overrides)

AI_PROVIDER          lmstudio | ollama | anthropic
LMSTUDIO_BASE_URL    http://host.docker.internal:1234/v1
LMSTUDIO_API_KEY     sk-lm-…
LMSTUDIO_MODEL       gemma-4-e4b-it          ← current
OLLAMA_BASE_URL / OLLAMA_MODEL / OLLAMA_API_KEY
ANTHROPIC_API_KEY / ANTHROPIC_MODEL

Credentials live in features/ai-service/.env (gitignored).

Error codes

Code Meaning
422 Bad request (empty messages, unknown priority)
502 Provider connection / API error
503 Provider not configured / unknown provider
504 Provider timeout

Architecture

Callers (doc-service, future services)
    │
    └─▶ POST /chat (sync)         ─┐
    └─▶ POST /queue/jobs (async)  ─┤
                                   ▼
                        asyncio.PriorityQueue
                        (HIGH=1, NORMAL=3, LOW=5)
                                   │
                        QueueWorker (single task)
                                   │
                        execute_chat(request)
                                   │
                        Provider SDK (openai / anthropic)
                                   │
                        LM Studio / Ollama / Anthropic API

Known limitations / not implemented

  • TLS to LM Studio — communication is plain HTTP (http://host.docker.internal:1234). Deferred until LM Studio HTTPS configuration is confirmed. When ready: set LMSTUDIO_BASE_URL=https://... and optionally add ssl_verify + ca_bundle config keys to the OpenAI-compat provider.
  • True preemption — a HIGH job arriving while a LOW job is processing will be next in queue but will not interrupt the running inference.
  • Queue persistence — the in-memory queue is lost on container restart. Pending jobs are not persisted to disk.
  • Authentication on queue endpoints/queue/* management endpoints have no auth guard. Should be protected before any public/multi-tenant deployment (internal network is the only current protection).
  • Streaming responses/chat returns the full response after generation. Streaming (Server-Sent Events) not implemented.
  • Metrics / observability — no Prometheus metrics, no structured request logging per job.

Future work

  • TLS support for LM Studio / Ollama (ssl_verify, ca_bundle config)
  • Auth guard on queue management endpoints (admin token or internal-only route)
  • Streaming responses via SSE (POST /chat/stream)
  • Queue persistence (SQLite or Redis-backed) so jobs survive restarts
  • Job result TTL / cleanup (currently jobs accumulate in _jobs dict indefinitely)
  • Per-caller priority override (e.g. doc-service background jobs = LOW, user-triggered = NORMAL)
  • Metrics endpoint (/metrics) for queue depth, job latency, provider error rate