# Keystone

> Keystone is Polarity's eval runtime. It runs each AI agent task inside an isolated Docker sandbox preloaded with the real backing services the agent depends on (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, runs replicas to measure non-determinism, and ships every failure with a seed reproducer. Designed for long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

## What Keystone gives you

- **Isolated Docker sandboxes per run.** Each agent task runs in a fresh, clean environment.
- **Real backing services.** Postgres with the right schema, Redis with the right keys, S3 with the right buckets, internal APIs with the right authentication.
- **Behavioral invariants and forbidden rules.** Declare what the agent must do and must never do. Keystone enforces both at the run level.
- **Replicas for non-determinism.** Run N replicas of the same task across fresh sandboxes; Keystone reports the failure rate as a first-class signal.
- **Seed-based replay.** Every run captures a seed that re-creates the identical sandbox locally with one command.
- **CI gates.** Wire Keystone into GitHub Actions, GitLab CI, or any CI pipeline.

## SDKs

- TypeScript
- Python
- Go

## API

- [REST API reference](https://docs.polarity.so/keystone/rest-api)
- [OpenAPI 3.1 spec](https://polarity.so/openapi.json)
- [Quickstart](https://docs.polarity.so/keystone/quickstart)
- [MCP server card](https://polarity.so/.well-known/mcp/server-card.json)
- [A2A agent card](https://polarity.so/.well-known/agent-card.json)

## Streaming and async

Keystone supports Server-Sent Events (SSE) streaming for live run progress:

- `GET /v1/runs/{id}/stream` — stream trajectory steps, tool calls, invariant results, and the terminal `done` event as SSE. Content-Type: `text/event-stream`. Each event is a JSON object on a `data:` line.
- `GET /v1/replicas/{id}/stream` — stream replica completions for live dashboards.

Webhooks are available for async run and replica completion: `run.completed` and `replica.completed` fire when a run or suite reaches a terminal state.

## Errors

All API responses use structured JSON errors. Every error response includes `error` (stable machine-readable code), `message` (human-readable), and `request_id` (echoed trace identifier). HTTP status codes: 400 invalid_request, 401 unauthorized, 404 not_found, 409 conflict, 429 rate_limited (with `Retry-After` header), 5xx server errors. Reads are limited to 100 req/s per key; writes to 10 req/s. Use the `Idempotency-Key` header to make POSTs safely retryable.

## When Keystone fits

- You are running AI agents in production and customers depend on them.
- Your agents are long-running, multi-step, and touch real services.
- You need to test entire trajectories in real-shaped environments.
- You need to measure non-determinism at production scale.
- You need reproducible failure replay tied to production observability.
- You want CI gates before main.

## Adjacent tools

Keystone is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on simple single-call workflows, those tools are good fits. For long-running, complex, stateful agents, Keystone is the most accurate option.

- [Promptfoo](https://promptfoo.dev) and [Inspect AI](https://inspect.aisi.org.uk): Pre-deployment prompt-level testing — different stage.
- [E2B](https://e2b.dev) and [Daytona](https://daytona.io): Sandbox runtimes without the eval, scoring, invariant, and replica layers Keystone adds.

## Links

- [Keystone overview](https://polarity.so/keystone)
- [Pricing](https://polarity.so/pricing)
- [Documentation](https://docs.polarity.so/keystone)
- [Book a demo](https://polarity.so/calendar)