1d01cc3b0e
Each feature service owns its system prompt in its config JSON on the shared volume. The AI Settings page now has General and System Prompts tabs — admins can view and edit any service's prompts at runtime with changes taking effect within 30 s (config cache TTL). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
119 lines
5.1 KiB
Markdown
119 lines
5.1 KiB
Markdown
# AI Service — Status
|
|
|
|
## What it is
|
|
|
|
Shared AI intermediary container. All feature containers (doc-service, future services) POST prompts here. It routes requests to the configured model (LM Studio / Ollama / Anthropic) and returns a normalised response. It is **stateless** — no database, no conversation history. History and context are the caller's responsibility.
|
|
|
|
Port: `8010` (internal only, not exposed to host).
|
|
|
|
---
|
|
|
|
## Current functionality
|
|
|
|
### Endpoints
|
|
|
|
| Method | Path | Description |
|
|
|--------|------|-------------|
|
|
| `POST` | `/chat` | Synchronous chat: submits at NORMAL priority, blocks until done |
|
|
| `GET` | `/health` | `{"status": "ok"}` |
|
|
| `GET` | `/health/provider` | Active provider name, model, configured flag |
|
|
| `POST` | `/queue/jobs` | Async enqueue — returns `job_id` immediately |
|
|
| `GET` | `/queue/jobs/{id}` | Poll job: status, position, result, error |
|
|
| `DELETE` | `/queue/jobs/{id}` | Cancel a pending job |
|
|
| `GET` | `/queue/status` | Worker state: running, paused, queue_size, current_job_id |
|
|
| `POST` | `/queue/pause` | Finish current job, stop picking new ones |
|
|
| `POST` | `/queue/resume` | Unpause |
|
|
| `POST` | `/queue/start` | Start (or restart) the worker task |
|
|
| `POST` | `/queue/stop` | Stop worker (pending jobs stay queued) |
|
|
|
|
### Priority queue
|
|
|
|
- Three levels: `high` (1) > `normal` (3) > `low` (5)
|
|
- FIFO within same priority level (monotonic sequence counter)
|
|
- Single async worker — one LLM call at a time
|
|
- Pause / resume / start / stop without restarting the container
|
|
- `POST /chat` is a synchronous wrapper: enqueues at NORMAL, awaits the future
|
|
|
|
### Providers
|
|
|
|
| Provider | Protocol | SDK |
|
|
|----------|----------|-----|
|
|
| LM Studio | OpenAI-compatible HTTP | openai |
|
|
| Ollama | OpenAI-compatible HTTP | openai |
|
|
| Anthropic | Anthropic API (HTTPS) | anthropic |
|
|
|
|
Active provider is selected by `"provider"` key in `/config/ai_service_config.json` (shared Docker volume), with env var overrides for dev.
|
|
|
|
### Configuration (env var overrides)
|
|
|
|
```
|
|
AI_PROVIDER lmstudio | ollama | anthropic
|
|
LMSTUDIO_BASE_URL http://host.docker.internal:1234/v1
|
|
LMSTUDIO_API_KEY sk-lm-…
|
|
LMSTUDIO_MODEL gemma-4-e4b-it ← current
|
|
OLLAMA_BASE_URL / OLLAMA_MODEL / OLLAMA_API_KEY
|
|
ANTHROPIC_API_KEY / ANTHROPIC_MODEL
|
|
```
|
|
|
|
Credentials live in `features/ai-service/.env` (gitignored).
|
|
|
|
### Error codes
|
|
|
|
| Code | Meaning |
|
|
|------|---------|
|
|
| 422 | Bad request (empty messages, unknown priority) |
|
|
| 502 | Provider connection / API error |
|
|
| 503 | Provider not configured / unknown provider |
|
|
| 504 | Provider timeout |
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Callers (doc-service, future services)
|
|
│
|
|
└─▶ POST /chat (sync) ─┐
|
|
└─▶ POST /queue/jobs (async) ─┤
|
|
▼
|
|
asyncio.PriorityQueue
|
|
(HIGH=1, NORMAL=3, LOW=5)
|
|
│
|
|
QueueWorker (single task)
|
|
│
|
|
execute_chat(request)
|
|
│
|
|
Provider SDK (openai / anthropic)
|
|
│
|
|
LM Studio / Ollama / Anthropic API
|
|
```
|
|
|
|
---
|
|
|
|
## Known limitations / not implemented
|
|
|
|
- **TLS to LM Studio** — communication is plain HTTP (`http://host.docker.internal:1234`). Deferred until LM Studio HTTPS configuration is confirmed. When ready: set `LMSTUDIO_BASE_URL=https://...` and optionally add `ssl_verify` + `ca_bundle` config keys to the OpenAI-compat provider.
|
|
- **True preemption** — a HIGH job arriving while a LOW job is processing will be next in queue but will not interrupt the running inference.
|
|
- **Queue persistence** — the in-memory queue is lost on container restart. Pending jobs are not persisted to disk.
|
|
- **Authentication on queue endpoints** — `/queue/*` management endpoints have no auth guard. Should be protected before any public/multi-tenant deployment (internal network is the only current protection).
|
|
- **Streaming responses** — `/chat` returns the full response after generation. Streaming (Server-Sent Events) not implemented.
|
|
- **Metrics / observability** — no Prometheus metrics, no structured request logging per job.
|
|
|
|
---
|
|
|
|
## System prompts
|
|
|
|
Each feature service (doc-service, future services) owns its own system prompt, stored in that service's config JSON on the shared volume. The backend settings API (`GET/PATCH /api/settings/system-prompts`) aggregates and edits them. The AI Service Settings UI exposes a **System Prompts** tab for editing all registered service prompts at runtime.
|
|
|
|
---
|
|
|
|
## Future work
|
|
|
|
- [ ] TLS support for LM Studio / Ollama (`ssl_verify`, `ca_bundle` config)
|
|
- [ ] Auth guard on queue management endpoints (admin token or internal-only route)
|
|
- [ ] Streaming responses via SSE (`POST /chat/stream`)
|
|
- [ ] Queue persistence (SQLite or Redis-backed) so jobs survive restarts
|
|
- [ ] Job result TTL / cleanup (currently jobs accumulate in `_jobs` dict indefinitely)
|
|
- [ ] Per-caller priority override (e.g. doc-service background jobs = LOW, user-triggered = NORMAL)
|
|
- [ ] Metrics endpoint (`/metrics`) for queue depth, job latency, provider error rate
|