Percival Labs
All Research
Defensive DisclosurePL-DD-2026-003

Per-Agent Policy Enforcement and Budget Management at the AI Inference Proxy Layer

Alan CarrollMarch 5, 2026

Defensive Disclosure. This document is published to establish prior art under 35 U.S.C. 102(a)(1) and prevent the patenting of the described methods by any party. The protocol-level concepts are dedicated to the public domain. Specific implementations, trade secrets, and trademarks are retained by Percival Labs.

Abstract

This disclosure describes a system and method for governing AI agent inference access through a proxy layer that enforces per-agent model allowlists, budget caps with configurable reset periods, and agent self-service introspection APIs—all without requiring modifications to the agents themselves.

The system enables platform operators to govern heterogeneous fleets of AI agents through a single configuration surface, shifting governance enforcement from distributed application code to a centralized infrastructure layer positioned between agents and upstream model providers.

1. The Problem

Organizations deploying multiple AI agents face a compound governance challenge spanning three domains: multi-agent policy enforcement, budget control, and agent operational awareness. As of March 2026, no standardized mechanism exists to address these at the infrastructure layer.

1.1 The Multi-Agent Governance Problem

Each agent may require different model access, different spending limits, and different levels of operational visibility. Current approaches embed governance logic within each agent’s application code, creating compounding problems:

Distributed Enforcement Is Unreliable

Each agent must correctly implement policy checks. A bug in one agent can result in unauthorized model access or uncontrolled spending. There is no single enforcement point that guarantees compliance across all agents.

Configuration Drift

Policy changes require updating and redeploying each agent individually. Organizations with dozens or hundreds of agents face operational complexity proportional to fleet size.

No Separation of Concerns

Agent developers must understand and implement governance logic alongside their domain logic. This conflates two distinct responsibilities and increases the surface area for errors.

Limited Visibility

Without centralized enforcement, there is no unified view of which agents are consuming which resources at what cost. Audit trails must be aggregated from individual agent logs.

1.2 The Budget Enforcement Problem

AI inference APIs charge per-token, with costs varying by orders of magnitude across models—from $0.25/million tokens for small models to $75/million tokens for frontier reasoning models. An agent with unrestricted access can accumulate significant costs through:

Model Selection Errors

An agent configured to use a $3/million-token model that inadvertently routes to a $75/million-token model due to a configuration error or prompt injection.

Runaway Loops

An agent caught in a tool-use retry loop generating thousands of requests, each incurring cost.

Prompt Inflation

Increasingly large context windows (up to 200K tokens per request) mean a single malformed request can cost hundreds of dollars.

Existing cloud provider billing operates at the account level with monthly invoices. There is no mechanism to enforce spending limits per-agent, per-period, at the point of inference. Organizations discover overspend after the fact, not before the request is made.

1.3 The Agent Self-Awareness Problem

AI agents operating autonomously benefit from awareness of their own operational constraints. An agent that knows it has consumed 80% of its budget can proactively switch to cheaper models or defer non-urgent tasks. An agent that knows which models are available to it can select appropriately without trial-and-error requests that fail with authorization errors. No existing inference infrastructure provides agents with self-service APIs for querying their own policies, budgets, and usage.

2. The Solution: Proxy Pipeline

The system operates as a proxy layer—implemented as a serverless function, edge worker, or reverse proxy—positioned between AI agents and upstream model provider APIs. The proxy intercepts all inference requests and applies the following pipeline:

Agent Request
→ Authentication (verify agent identity)
→ Agent Self-Service API (if /agent/* path, return introspection data)
→ Rate Limiting (per-identity, tier-based)
→ Body Parsing (extract model from request)
→ Auto-Route Resolution (resolve provider from model name)
→ Model Policy Check (verify model in agent’s allowlist)
→ Budget Pre-Check (reject if budget exhausted)
→ Forward to Upstream Provider
→ Extract Token Counts from Response
→ Compute Cost (using pricing table)
→ Record Budget Spend (async)
→ Report Usage (async)
→ Anomaly Detection (async)
→ Emit Audit Log
→ Return Response with Governance Headers

3. Per-Agent Configuration

Each agent is identified by a long-lived authentication token (e.g., a 256-bit random hex string). The token maps to a configuration record in a key-value store containing:

pubkey: Cryptographic public key for the agent, enabling cross-system identity
agentId: Human-readable identifier
name: Display name
tier: Trust tier override (e.g., "standard", "elevated", "unlimited")
models: Array of permitted model identifiers (empty array = all models permitted)
defaultModel: Model to inject when request doesn't specify one
budget: Object containing maxSats (maximum spend per period) and periodDays (reset interval)

This single record is the complete governance configuration for one agent. Fleet management reduces to CRUD operations on these records. The configuration scales from a solo developer with one agent to an enterprise with thousands of agents without architectural changes.

4. Model Allowlist Matching

Model identifiers in AI inference APIs use two conventions: bare names (e.g., “claude-sonnet-4”) and provider-prefixed names (e.g., “anthropic/claude-sonnet-4”). The proxy’s allowlist matching handles both by comparing the bare portion of the requested model against the bare portion of each allowed model.

This means an allowlist entry of “claude-sonnet-4” permits requests for both “claude-sonnet-4” and “anthropic/claude-sonnet-4”, preventing policy circumvention through name format variation.

When an inference request arrives without a model field in the request body, the proxy injects the agent’s configured default model before forwarding to the upstream provider, enabling agents to operate without hardcoded model names.

5. Two-Phase Budget Enforcement

Budget state per agent is stored as three values: cumulative spend in the current period, timestamp when the current period began, and timestamp of the most recent spend recording. On each request, the proxy executes:

Phase 1: Pre-Check

Before forwarding the request to the upstream provider, the proxy reads current budget state, checks if the period has elapsed (resetting to zero if so), and rejects with HTTP 402 if cumulative spend has reached the configured maximum. This avoids unnecessary upstream API costs.

Phase 2: Post-Response Recording

After receiving the upstream response, the proxy computes actual cost from the response's token usage data and pricing table, then writes the updated spend back to the key-value store asynchronously to avoid adding latency to the response path.

Concurrency Model

Budget tracking uses a read-then-write pattern against a key-value store (not a transactional database). Concurrent requests from the same agent may result in slight overspend. This mirrors the soft-limit model used by cloud infrastructure providers—an acceptable trade-off for the performance benefit of key-value store latency versus database transactions.

For organizations requiring exact budget enforcement, the system can be extended with serialized access via durable objects or distributed locks, at the cost of increased latency. Key-value store entries are configured with TTL equal to the remaining period plus a buffer, ensuring automatic cleanup without manual garbage collection.

6. Agent Self-Service APIs

The proxy exposes authenticated endpoints that agents can call to query their own operational state, using the same token they use for inference:

EndpointReturns
GET /agent/v1/meFull configuration: models, tier, budget parameters
GET /agent/v1/me/budgetCurrent spend, remaining amount, percent utilized, period boundaries, and actionable warnings
GET /agent/v1/me/usage24-hour usage statistics: request counts per hour, models used, average prompt length
GET /agent/v1/modelsAvailable models and routing guidance

These endpoints do not count against the agent’s rate limit, enabling agents to check their status without consuming inference quota. Budget warnings are actionable (e.g., “Budget 80% used—2,000 sats remaining”).

7. Governance Response Headers

Every inference response includes headers communicating governance state to the agent, enabling informed decisions about subsequent requests without requiring a separate API call:

X-Vouch-Tier: Current trust tier
X-Vouch-Rate-Remaining: Remaining requests in rate limit window
X-Vouch-Model: Model that was actually used
X-Vouch-Provider: Upstream provider that handled the request
X-Vouch-Cost-Sats: Estimated cost of this request
X-Vouch-Budget-Max: Agent's total budget cap
X-Vouch-Budget-Cost: Cost charged against budget for this request
X-Vouch-Input-Tokens / X-Vouch-Output-Tokens: Token counts from the upstream response

8. Structured Audit Logging

The proxy emits machine-parseable structured log entries for every request at every governance decision point. Each entry records:

FieldPurpose
Action Typeinference, rate-limited, budget-exceeded, model-blocked, auth-failed
IdentityAuthenticated agent identity (truncated for privacy)
Model & ProviderRequested model and upstream provider that handled it
Cost DataToken counts, estimated cost, HTTP status code
Governance ContextTrust tier, request duration—no request or response bodies (preserving prompt privacy)

This creates a complete audit trail suitable for compliance reporting without logging prompt content.

9. Trust-Tier Integration

Agent identities are associated with both a configured trust tier (determining rate limits) and a trust score from an external scoring system. The proxy uses the higher of the two—configured tier or score-derived tier—enabling agents to “earn” higher rate limits through demonstrated trustworthiness while maintaining a minimum floor configured by the platform operator.

Platform administration APIs, separately authenticated from agent credentials using a platform-level secret, enable programmatic fleet management: creating, reading, updating, and deleting agent identity configurations without direct access to the underlying key-value store.

10. Novel Contributions

The following aspects are believed to be novel as of the filing date:

  1. Enforcement of per-agent model access policies at an inference proxy layer rather than within agent application code, using configurable allowlists with bare/prefixed model name matching
  2. Per-agent budget caps with configurable reset periods enforced at the point of inference, with two-phase enforcement (pre-check before upstream call, spend recording after)
  3. Agent self-service introspection APIs co-located with the inference proxy, enabling agents to query their own governance state through the same endpoint they use for inference
  4. A universal agent configuration surface wherein all governance parameters for an agent are stored as a single serialized record, scaling from single-agent to enterprise-fleet without architectural changes
  5. Structured audit logging at every governance decision point in the inference proxy pipeline, recording governance metadata without prompt content
  6. Governance state communicated via response headers on every inference response, enabling agents to adapt behavior without separate API calls
  7. Platform administration APIs for programmatic fleet management of agent governance configurations, separately authenticated from agent credentials
  8. Integration of trust-tier systems with per-agent policy enforcement, using the higher of configured and earned trust levels

11. Prior Art Established

DateArtifact
Feb 23, 2026Defensive Disclosure PL-DD-2026-001: Economic Trust Staking for AI Model Inference APIs (establishes staking, vouching, and slashing primitives)
Feb 24, 2026Defensive Disclosure PL-DD-2026-002: Economic Accountability Layer for AI Agent Tool-Use Protocol Governance (establishes overlay governance for tool-use protocols)
Mar 4, 2026Vouch Gateway v0.2.0 deployed with NIP-98 auth, trust-tiered rate limiting, blind signature privacy tokens, and anomaly detection (3 providers: Anthropic, OpenAI, OpenRouter)
Feb 22, 2026Vouch Agent SDK and API deployed with Nostr identity, NIP-98 auth, and trust scoring
2025–2026Continuous git commit history documenting protocol development including inference proxy governance concepts

Filed as a defensive disclosure by Percival Labs, Bellingham, WA, USA. This document constitutes prior art under 35 U.S.C. 102(a)(1). The described protocol-level concepts are dedicated to the public domain for the purpose of preventing patent claims. All rights to specific implementations, trade secrets, and trademarks are reserved.

Document ID: PL-DD-2026-003 · Contact: [email protected]