LLM Gateway — AI Warden

Why a gateway

If every team uses LLMs, every team needs guardrails.

A gateway turns a fleet of direct provider calls — uncountable, unbounded, untraceable — into one well-governed pipe. You set policy once, in one place. Every team gets the same guarantees. Every audit row has a real owner.

ThreatCWE-798 · CWE-540

Provider API keys leak into source, terminals, CI logs.

A `sk-…` key committed to git, pasted into a Slack channel, exfiltrated from a stale `.env` file, or echoed in a CI step ends up in a paste site within hours. The blast radius is the whole provider account.

Control Keys live only on the gateway. Clients receive a short-lived, scoped, revocable PAT. Rotate keys without redeploying a single client.

Threatno-budget runaway

One unbounded loop becomes a six-figure bill overnight.

An agent retries on every error. A batch job runs ten times because the queue lost ack. A dev forgets to cap `max_tokens`. The provider charges by the token, not the intent.

Control Hard ceilings per team, project, and model. Soft alerts at 80%. Block at 100%. Per-call token bounds enforced at the gateway, not trusted to the client.

Threatprompt injection

Untrusted text becomes the new untrusted input.

RAG pipelines, document summarisation, copilots — all read content the user did not write. Hidden instructions in that content can make a tool-using agent leak data, call dangerous tools, or escape its envelope.

Control Input and output scanners detect injection markers, jailbreak templates, prompt-leak signatures, and known exploits. Flag, redact, or block — your choice, per route.

Threatdata exfiltration

Sensitive data leaves with the prompt and never comes back.

Customer PII, API secrets, internal IDs, source code — all routinely pasted into prompts. The provider keeps a copy. Your DLP didn't see it because the egress looks like an ordinary HTTPS POST.

Control Inline PII detectors and secret scanners. Redact before egress, log the redaction, alert on the pattern. Same engine as the MCP scanners — one rule library, both surfaces.

Cost control

From "how much did we spend on AI last quarter?" to "this team is at 78% of budget."

The gateway sits in the path of every token. That makes cost a property of the platform, not a quarterly surprise. Set ceilings, cache smart, route cheap-when-cheap-is-fine, and turn the procurement bill into a dashboard.

Hard ceilings per team, project, model. Block at 100%, alert at any threshold.
Per-call token bounds enforced at the gateway — agents can't lie about `max_tokens`.
Response cache with TTL + cache-key control. Identical prompts pay once.
Model right-sizing rules: route cheap models for cheap jobs, escalate when a quality scanner trips.
Per-token unit economics attributed to the calling identity — agent, principal, team, or product.

See the LLM cost & key risk deep-dive

Spend by team · MTD May 1 – 8

payments-platform$7,820 / $20k

customer-ops$13,402 / $20k

risk-platform$10,910 / $12k

internal-search$4,810 / $20k

copilots-eng$10,250 / $20k

research$1,948 / $15k

within budget over 80% — alert sent 3 alerts · 0 blocks

Cache hit rate38%last 24h · ~$2,140 saved

Avg model size↓ 41%vs. policy baseline

Key custody

Provider keys never leave the gateway.

The single biggest cause of an LLM-cost incident is a leaked key, full stop. AI Warden treats provider credentials like database credentials: they live in one place, they are rotated routinely, and no human or client ever sees them.

Server-side credential vault with envelope encryption. Backed by AWS KMS, Azure Key Vault, or HashiCorp Vault.
Short-lived PATs for clients — scoped to one team, one set of models, one budget. Revocable in one click.
OIDC client credentials (M2M) for system principals. No long-lived secrets in CI.
Rotation without redeploy — swap a provider key on the gateway, and every client keeps working.

before

# client.py — committed last week
import openai
openai.api_key = "sk-prod-7f2a-…"
openai.chat.completions.create(...)

A real key, in clear text, in source. Rotation means hunting every repo, every CI runner, every laptop.

after

# client.py — same shape, no provider key
import openai
openai.base_url = "https://gw.aiwarden.io/v1"
openai.api_key  = os.environ["AIW_PAT"]
openai.chat.completions.create(...)

A scoped, expiring PAT. Mint a new one in seconds. Provider keys never leave the platform.

Prompt firewall

Treat untrusted text like untrusted input. Because it is.

Two scanner pipelines run on every gateway hop — one on the request, one on the response. Configure each rule to flag, redact, or block. The same engine runs on MCP requests, so you write a rule once and apply it everywhere.

Prompt injection & jailbreaks

Detect known instruction-override patterns, role-confusion templates, and indirect-injection markers in tool inputs and RAG context.

Direct injection ("ignore previous…")
Indirect injection in retrieved docs
Function-call coercion attempts

Secrets & tokens

Provider keys, AWS access keys, GitHub PATs, GCP service-account JSON, JWTs, and 70+ other patterns. Redact before they leave your network.

Format + entropy heuristics
Issuer-specific signatures
Per-rule action: flag · redact · block

PII & regulated data

Names, emails, phone numbers, national IDs, payment card numbers. Locale-aware (UK, EU, US, India, Singapore). Redact in-flight or block by classification.

Per-locale rule sets
Per-route & per-team policy
Audit row records the redaction, not the data

Anomaly & behaviour

Volume, latency, model-mix, and tool-call shape baselines per principal. Page when an agent strays from its envelope.

Custom rules

Bring your own regex, your own classifier, or your own LLM-judge rule. Hot-reload without restarting the gateway. Per-rule action and per-rule blast-radius preview before you ship.

Output guardrails

Same scanner library, applied to model output. Stop the assistant from quoting back the secret it found, or from emitting content that violates your policy.

Model routing

The right model for the job. Decided by policy, not by client config.

OpenAI-compatible in, any model out. Route by team, by route, by content classification, by cost ceiling, or by quality. Fail over between providers without rewriting clients.

Provider-agnostic. OpenAI · Anthropic · Azure OpenAI · AWS Bedrock · Google Vertex · self-hosted.
Cost-aware fallback. Try a small model first, escalate on a quality scanner, log both attempts.
Failure-aware fallback. Provider 5xx? Drain to the next with a circuit-breaker.
Region pinning. Route EU traffic to EU regions only. Same for UK, US, Asia.

policy.yamlroute: customer-ops · escalate-on-quality

# Try a small, cheap model first.
primary:   openai:gpt-4o-mini       # $0.15 / 1M in
fallback:  anthropic:haiku-3-5      # on 5xx or quota

# Escalate when the quality scanner is unhappy.
escalate_to: openai:gpt-4o
escalate_if: scanner.quality < 0.7
escalate_max: 1

region_pinning: eu-only
budget_team:    customer-ops
budget_cap_usd: 20000
redact_pii:     true
cache:          true
cache_ttl:      15m

Observability

Every request, by every principal, for every model. Forever.

Each request lands in the immutable audit log and the ClickHouse request log within milliseconds. Stream a copy to your SIEM, your data lake, or your existing observability stack — same structured payload, no proprietary format.

Median p50 overhead~ 8 msadded gateway latency

Tail p99~ 22 msscanners + audit fan-out

Audit retention∞signed, immutable, exportable

SIEM sinks5+Splunk · Sentinel · Elastic · S3 · Kafka

Integrate

Drop in. Don't rewrite.

Set the base URL.
Point your existing OpenAI / Anthropic / Bedrock SDK at the gateway. The wire shape is identical to the upstream provider — every existing client (curl, Python SDK, LangChain, OpenAI Node, Vercel AI SDK) keeps working.
Mint a PAT.
Personal access token from the portal, OIDC client credentials for service principals. Scope to a team, a set of models, a budget. One click to revoke.
Pick a policy.
Start with a default — observe, scan, log, no enforcement. Promote rules to flag, then to block with blast-radius preview before you ship.
Watch the request log.
Every prompt, every completion, every scanner verdict, every cost line. Filter by principal, model, team, route, time, verdict — all in ClickHouse, all in the portal.

Specification

At a glance

Area	What it does	Configurable per
Wire compatibility	OpenAI & Anthropic & Bedrock-style payloads accepted; routes to any provider.	route
Auth	PAT (short-lived) · OIDC client credentials · mTLS. SSO into Keycloak / Entra / Okta.	principal
Cost	Hard ceilings; soft alerts at any threshold; per-call token bounds; cache.	team · project · model
Scanners	Prompt-injection · secrets · PII · custom rules · quality · output guardrails.	route · per-rule action
Routing	Provider · region · model · cost-aware fallback · failure fallback · circuit breaker.	route
Observability	Immutable audit log; ClickHouse request log; OTel; SIEM fan-out.	tenant
Identity for agents	System principals; per-agent budget; per-agent scope; behavioural baselines.	agent
Deployment	Self-hosted single binary (Go) or managed SaaS. Same code path either way.	tenant
Latency overhead	p50 ~8 ms · p99 ~22 ms (scanners + audit fan-out, EU region).	—

Next step

Send your first prompt through it.

A 45-minute working session. Your provider, your network, your model. Leave with a working gateway, a PAT in CI, and a real policy.

Book a working session Request demo access