Register the agent
Create the agent profile: owner, workflow, department, model provider, approved tools, runtime source, and Blackbox key.
Observability tells you what happened. Blackbox helps answer: should this agent be allowed to do this right now, can we afford it, is this tool approved, are we actually seeing the full run, and what should be blocked next time?
Read left to right. Each node is a real stage in the Blackbox harness.
Create the agent profile: owner, workflow, department, model provider, approved tools, runtime source, and Blackbox key.
Use the SDK wrapper, LLM gateway, Cloud Run/log source, push drain, OTLP ingest, MCP proxy, or active probe endpoint.
The runtime fetches current spend caps, model allowlists, tool policies, blocked tools, connector budgets, and parameter caps.
Before LLMs or tools run, Blackbox checks budgets, rates, model choice, blocked tools, schema rules, connector limits, and duration caps.
Model calls, tool calls, MCP, scripts, browser actions, DB/storage, notifications, infra cost, reasoning, and errors become ordered steps.
Events stream into ingest APIs and become runs, steps, costs, tool costs, gateway requests, logs, alerts, audit rows, and trip logs.
Heartbeats report hooks, wrapped boundaries, strict mode, unwrapped calls, SDK version, policy version, and coverage score.
Cloud Run verification, log-reader checks, static coverage scans, active HTTP probes, and known-failure scans catch blind spots.
Error classifier, registry, loop detection, empty-result alerts, cost attribution, gateway guardrails, and FinOps views identify the root cause.
Operators review traces, costs, guardrails, registry, alerts, logs, probes, policies, and integrations, then adjust controls.
Reconstructs the agent run from ordered steps instead of trying to infer behavior from generic server logs.
Blocks over-budget, unauthorized, mis-scoped, or unknown tool/model calls before execution.
Shows which boundaries are wrapped, which hooks are installed, and where instrumentation gaps remain.
Turns raw failures into known patterns with severity, category, provider, cost impact, and suggested fix.
These are the concrete code paths behind the harness. Click any module to show the exact snippet.
Initializes run identity, local guardrails, data policy, error registry, schema validation, model switching, fetch wrapping, and live policy refresh.
Runs model allowlist, rate limit, budget, tool, and auto-switch checks before the LLM or tool call happens.
Buffers events, flushes to `/api/ingest/stream`, retries failures, and never includes provider API keys.
Detects raw vendor calls, raw LLM clients, and missing SDK initialization so teams can fix blind spots before production.
Checks an agent manifest against known failure history and can block critical repeated failures before deployment.
Parses a Cloud Run URL, probes Cloud Logging with Blackbox credentials, and returns a copyable IAM command if permission is missing.
This answers the practical technical question: is it just the SDK, a URL, or cloud access?
Customer installs `@blackbox/sdk`, adds `BLACKBOX_API_KEY` and `BLACKBOX_ENDPOINT`, wraps their LLM client, and wraps external tools with `session.tool()`.
Customer pastes their Cloud Run service URL. We parse service name, project number, and region, then verify whether Blackbox can read logs.
For Vercel, Cloudflare, AWS, GCP sinks, OTLP, or custom runtimes, the customer streams logs/events to a Blackbox ingest URL with an agent key.
For a public HTTP agent endpoint, Blackbox can run active probes against approved sample payloads and record auth, schema, latency, and failure behavior.
Blackbox is designed to prove coverage and surface risk before, during, and after deployment.
Scan source for raw vendor endpoints, raw LLM clients, and missing `BlackboxSession` initialization. Output is a findings report with remediation.
For Cloud Run, paste URL, parse service metadata, probe logs, and ask the customer admin for `roles/logging.viewer` only if needed.
SDK reports strict mode, hooks installed, wrapped boundaries, unwrapped calls, coverage score, SDK version, and live policy version.
Blocked tools, model allowlists, per-call limits, per-run budgets, connector budgets, rate limits, and duration caps fire before work executes.
Raw errors are classified and matched against recurring failures so new agents can inherit lessons from previous production incidents.
Every run produces a timeline of model calls, tools, scripts, cost, latency, policy state, error class, and source coverage.
Most AI observability tools help teams inspect model calls after they happen. The current market covers important pieces of AI operations, but each category stops short of controlling agent execution while it is happening.
Blackbox is built for the next problem: agents are no longer just generating text, they are doing work. They call tools, scrape websites, query databases, trigger APIs, run scripts, and retry. That creates hidden spend and hidden risk.
These are not generic promises. They map to guardrails, classifiers, alert taxonomy, and predictive cost analyzers we already built and tested.