Blackbox Agent Harness

Observe, enforce, and explain every AI agent run.

Most observability tools just inspect model calls. Blackbox is not another observability dashboard; it is a runtime control harness for AI agents that spend money, call tools, and take actions.

Observability tells you what happened. Blackbox helps answer: should this agent be allowed to do this right now, can we afford it, is this tool approved, are we actually seeing the full run, and what should be blocked next time?

The end-to-end control loop

Read left to right. Each node is a real stage in the Blackbox harness.

Runtime Enforcement Coverage Intelligence Outcome
1. Connect

Register the agent

Create the agent profile: owner, workflow, department, model provider, approved tools, runtime source, and Blackbox key.

Blackbox knows which agent is running, who owns it, and which integration path it uses.
2. Choose Path

Pick how telemetry enters

Use the SDK wrapper, LLM gateway, Cloud Run/log source, push drain, OTLP ingest, MCP proxy, or active probe endpoint.

The harness is not only the SDK; it has multiple entry points for different customer runtimes.
3. Policy

Pull live controls

The runtime fetches current spend caps, model allowlists, tool policies, blocked tools, connector budgets, and parameter caps.

Controls can change from the dashboard without redeploying the agent.
4. Runtime Gate

Block before execution

Before LLMs or tools run, Blackbox checks budgets, rates, model choice, blocked tools, schema rules, connector limits, and duration caps.

Bad or expensive work can be stopped before it burns money.
5. Capture

Record the whole run

Model calls, tool calls, MCP, scripts, browser actions, DB/storage, notifications, infra cost, reasoning, and errors become ordered steps.

The trace shows what the agent actually did, not just one model request.
6. Ingest

Normalize into Blackbox

Events stream into ingest APIs and become runs, steps, costs, tool costs, gateway requests, logs, alerts, audit rows, and trip logs.

Raw runtime events turn into structured operational data.
7. Prove

Verify coverage and access

Heartbeats report hooks, wrapped boundaries, strict mode, unwrapped calls, SDK version, policy version, and coverage score.

Teams can see what is covered, what is missing, and whether live policy is active.
8. Audit

Probe deployment risk

Cloud Run verification, log-reader checks, static coverage scans, active HTTP probes, and known-failure scans catch blind spots.

Security review becomes an explicit harness step, not an afterthought.
9. Diagnose

Explain failures and spend

Error classifier, registry, loop detection, empty-result alerts, cost attribution, gateway guardrails, and FinOps views identify the root cause.

Teams know why the agent failed, overspent, looped, or called the wrong thing.
10. Manage

Control from the dashboard

Operators review traces, costs, guardrails, registry, alerts, logs, probes, policies, and integrations, then adjust controls.

Blackbox becomes the management loop for production agents.

Observe

Reconstructs the agent run from ordered steps instead of trying to infer behavior from generic server logs.

Enforce

Blocks over-budget, unauthorized, mis-scoped, or unknown tool/model calls before execution.

Prove

Shows which boundaries are wrapped, which hooks are installed, and where instrumentation gaps remain.

Explain

Turns raw failures into known patterns with severity, category, provider, cost impact, and suggested fix.

Proof from the actual modules

These are the concrete code paths behind the harness. Click any module to show the exact snippet.

SDK runtime

`BlackboxSession` creates the agent boundary

Initializes run identity, local guardrails, data policy, error registry, schema validation, model switching, fetch wrapping, and live policy refresh.

Pre-call enforcement

`before()` blocks bad work before execution

Runs model allowlist, rate limit, budget, tool, and auto-switch checks before the LLM or tool call happens.

Async telemetry

`BlackboxClient` sends events without blocking the agent

Buffers events, flushes to `/api/ingest/stream`, retries failures, and never includes provider API keys.

Coverage scanner

`scanCoverageSource()` finds missing instrumentation

Detects raw vendor calls, raw LLM clients, and missing SDK initialization so teams can fix blind spots before production.

Deployment scanner

`DeploymentScanner` audits a new agent before it runs

Checks an agent manifest against known failure history and can block critical repeated failures before deployment.

Cloud Run probe

`/api/runtime/cloudrun/verify` verifies log access

Parses a Cloud Run URL, probes Cloud Logging with Blackbox credentials, and returns a copyable IAM command if permission is missing.

What the customer has to give us

This answers the practical technical question: is it just the SDK, a URL, or cloud access?

SDK path

Customer installs `@blackbox/sdk`, adds `BLACKBOX_API_KEY` and `BLACKBOX_ENDPOINT`, wraps their LLM client, and wraps external tools with `session.tool()`.

  • We do not need their OpenAI, Anthropic, Google, Firecrawl, Exa, or Apify keys.
  • Provider keys stay in their app, environment, or Composio connection.

Cloud Run URL path

Customer pastes their Cloud Run service URL. We parse service name, project number, and region, then verify whether Blackbox can read logs.

  • Expected format: `https://service-project.region.run.app`.
  • If access fails, we generate the exact `gcloud` IAM grant.

Push/log drain path

For Vercel, Cloudflare, AWS, GCP sinks, OTLP, or custom runtimes, the customer streams logs/events to a Blackbox ingest URL with an agent key.

  • Blackbox never reaches into their cloud by default.
  • This is best when the customer already owns a logging pipeline.

Security probe path

For a public HTTP agent endpoint, Blackbox can run active probes against approved sample payloads and record auth, schema, latency, and failure behavior.

  • Used for non-GCP deployments or audit demos.
  • Requires an endpoint and optional sample payload, not cloud credentials.
The clean answer: customers can start with only the SDK and a Blackbox key. Cloud deployment access is optional and scoped: for Cloud Run we need log-reader permission, not production write access. For push drains we need only an ingest destination. For active probes we need a URL and agreed test payloads.

Security audit flow

Blackbox is designed to prove coverage and surface risk before, during, and after deployment.

Before deploy

Static coverage scan

Scan source for raw vendor endpoints, raw LLM clients, and missing `BlackboxSession` initialization. Output is a findings report with remediation.

At onboarding

Runtime source verification

For Cloud Run, paste URL, parse service metadata, probe logs, and ask the customer admin for `roles/logging.viewer` only if needed.

In production

Live coverage heartbeat

SDK reports strict mode, hooks installed, wrapped boundaries, unwrapped calls, coverage score, SDK version, and live policy version.

During execution

Pre-call policy gates

Blocked tools, model allowlists, per-call limits, per-run budgets, connector budgets, rate limits, and duration caps fire before work executes.

After failure

Known failure registry

Raw errors are classified and matched against recurring failures so new agents can inherit lessons from previous production incidents.

Audit output

Replayable operational record

Every run produces a timeline of model calls, tools, scripts, cost, latency, policy state, error class, and source coverage.

What the market already does

Most AI observability tools help teams inspect model calls after they happen. The current market covers important pieces of AI operations, but each category stops short of controlling agent execution while it is happening.

Braintrust-style eval tools

  • Run evals against prompts, outputs, datasets, and regressions.
  • Help teams answer: “Was this AI response good enough?”
  • Do not primarily act as a live execution gate for tool spend, connector budgets, or runtime policy.

Observability platforms

  • Show traces, latency, errors, spans, model calls, and production debugging context.
  • Help teams answer: “What happened after the run executed?”
  • Often stop at visibility instead of proving coverage and blocking unauthorized calls before execution.

AI spend / ROI tools

  • Track budgets, unit economics, capacity, model usage, and business ROI.
  • Help teams answer: “Where did the money go?”
  • Usually do not reconstruct the agent’s step-by-step work path or stop an expensive tool call before it runs.

How Blackbox is different: runtime control

Blackbox is built for the next problem: agents are no longer just generating text, they are doing work. They call tools, scrape websites, query databases, trigger APIs, run scripts, and retry. That creates hidden spend and hidden risk.

It controls before execution

  • Checks model allowlists, per-call limits, per-run budgets, daily budgets, rate limits, and duration caps.
  • Blocks disabled tools and over-budget connectors before the agent spends money.
  • Answers: “Is this agent allowed to do this right now?”

It proves runtime coverage

  • Reports which boundaries are wrapped: agent, LLM, tools, MCP, scripts, browser, database, storage, infra, and notifications.
  • Detects unwrapped vendor calls and can run in strict mode to block unknown outbound calls.
  • Answers: “Are we actually seeing the full agent, or only the easy model call?”

It explains the operational failure

  • Classifies failures such as rate limits, wrong params, auth failures, tool hallucinations, loops, cost overruns, and context overflow.
  • Connects each issue to cost, tool, provider, model, run, policy state, and remediation.
  • Answers: “Why did this fail or overspend, and what do we change?”
So the category is not just observability. The category is agent control. Blackbox wraps the runtime so companies can monitor the full agent run, enforce controls before expensive actions happen, prove coverage, and explain failures.

Concrete control examples

These are not generic promises. They map to guardrails, classifiers, alert taxonomy, and predictive cost analyzers we already built and tested.

Budget kill switch

If one run crosses its budget, Blackbox kills the run

  • Example from tests: if a run spends `$5.01` against a `$5.00` cap, `circuit_breaker` fires.
  • Outcome: the next call is blocked with action `kill` before the agent keeps spending.
  • Why it matters: runaway loops fail fast instead of showing up later on the invoice.
Model control

If an agent switches to an unapproved model, Blackbox blocks it

  • Example from tests: allowed models are `gpt-4o` and `gpt-4o-mini`.
  • If the agent tries `claude-opus-4.7`, `model_whitelist` blocks before the call runs.
  • Why it matters: teams prevent surprise frontier-model spend and policy drift.
Tool and connector caps

If Firecrawl or Apify spend exceeds limits, Blackbox stops that path

  • Example from tests: Firecrawl connector budget is `$2.00`; cumulative calls hit `$2.10`.
  • Outcome: `connector_budget` blocks the next Firecrawl step.
  • Example from our alert map: if Apify estimated cost is `$0.07` and budget is `$0.05`, classify as `policy_block` with outcome `blocked`.
Cost surprise

If actual spend is much higher than estimated, Blackbox flags it

  • Example from tests: estimated cost is `$0.50`, actual tool cost comes back at `$1.50`.
  • Outcome: `cost_surprise` fires because actual cost is more than `2x` the estimate.
  • Why it matters: the agent may be calling a larger dataset, wrong actor, or expanded tool scope.
Predictive spend

If today’s run rate predicts a huge bill, Blackbox warns before damage accumulates

  • Example from our analyzer: project last-hour run cost across 24h and compare it with the 7-day daily average.
  • If projected daily cost is `3x+` historical average, create a `predictive_cost_projection` insight.
  • Recommendation: set daily cap at `3x` average, per-run cap at `5x` average, and enable surge protection.
Single-run anomaly

If one run costs 3x more than normal, Blackbox identifies the expensive step

  • Example from our analyzer: any run from the last 2 hours costing `3x` the historical per-run average is flagged.
  • Outcome: title like `Run abc123… cost $50.00 (10x average)` plus the most expensive step.
  • Why it matters: catches prompt regressions, model upgrades, tool output explosions, and loops quickly.
Loop / retry storm

If the same tool repeats without progress, Blackbox classifies it as a loop

  • Example from tests: six repeated `web_scraper` calls inside the loop window are detected.
  • Outcome: `loop_detected` becomes a critical operational failure pattern.
  • Why it matters: this is one of the fastest ways agents waste tokens, scrape credits, and API quota.
Schema and data quality

If a tool call is wrong or returns empty data, Blackbox separates failure from degradation

  • Example from tests: `Invalid input field 'startUrls'` becomes `schema_mismatch` and is not retryable.
  • Example from our alert map: if a tool succeeds but returns `[]`, classify as `empty_tool_result` with outcome `degraded`.
  • Why it matters: teams know whether to fix code, credentials, schema, data quality, or fallback logic.
LinkedIn scrape controls

If a scraper approaches a cap, Blackbox warns before it becomes a runaway job

  • Example from our alert map: if an agent reaches `7 of 10` configured LinkedIn profiles, emit `cap_near_limit`.
  • If no `maxProfiles` cap exists, emit `missing_guardrail_config`.
  • Why it matters: users see the governance gap before the scraper burns through profiles, credits, or quota.
Security and permission

If a credential or permission fails, Blackbox classifies the operational risk

  • Examples from tests: `401 Unauthorized` becomes `auth_failure`; `403 Forbidden` becomes permission/auth failure.
  • Gateway guardrails detect prompt injection, jailbreak, system-prompt exfiltration, and secret harvesting patterns.
  • Why it matters: the alert explains whether the problem is auth, scope, prompt attack, or runtime failure.