How to Prove What Happened When Your AI Provider Goes Dark

1. The Evidence Gap Problem 2. Anatomy of a Provider Outage 3. SWT3 Procedure Mapping 4. Before: Continuous Witness Chain 5. During: Gap Detection and Incident Evidence 6. After: Recovery Verification 7. Provider Inheritance and Shared Responsibility 8. Multi-Agent and Agentic Failures 9. Regulatory Requirements 10. Implementation Guide 11. Clearing Levels During Incidents

Audience: AI compliance officers, platform engineers, risk managers, and auditors responsible for maintaining evidence continuity during AI service disruptions. Applicable to any organization using third-party AI inference providers (OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, etc.).

1. The Evidence Gap Problem

When a major AI provider experiences a global outage, every downstream customer faces the same problem: total evidence blackout. There is no independent record of what was in-flight, what decisions were being processed, what data was exposed to the model, or how long the disruption lasted.

For organizations operating under compliance frameworks, this creates specific failures:

No proof of gap duration. Without independent timestamps, you cannot demonstrate to an auditor exactly when AI processing stopped and resumed.
No evidence of in-flight state. Requests that were submitted but never completed leave no trace. In agentic systems, partially executed tool chains vanish.
No drift baseline comparison. Post-recovery, you cannot verify that model behavior returned to its pre-outage baseline without an independent reference point.
No incident documentation. Regulatory frameworks require documented incident response. Without witnessing infrastructure, the incident report relies entirely on the provider's disclosure timeline.
Inherited compliance claims collapse. If your security posture inherits availability, integrity, or monitoring claims from the AI provider, an outage invalidates those inheritance assumptions until independently verified.

The core issue: Provider-side logging and monitoring fail at exactly the moment you need them most. When the provider is down, their telemetry is down with them. An independent witness layer is the only architecture that survives provider failure.

2. Anatomy of a Provider Outage

AI provider outages follow a predictable lifecycle with distinct evidence requirements at each phase:

Phase 1: Pre-Failure (Normal Operations)

Continuous witnessing establishes the baseline

Every inference call generates a witness anchor with latency, token counts, guardrail results, and model identity. This creates a verifiable record of normal operating parameters. When an outage occurs, the last successful anchor timestamp marks the exact moment of service loss.

Phase 2: Failure Onset (Minutes 0-5)

Timeouts, retries, and cascading failures

API calls begin failing. In agentic architectures, autonomous agents may retry aggressively, compounding the failure. Without circuit breakers or witness-aware retry logic, agents can generate thousands of failed requests, masking the root cause and creating resource contention across the provider's infrastructure. SWT3 witness anchors record each failed attempt with its error class, enabling post-incident reconstruction of the failure cascade.

Phase 3: Full Outage (Minutes 5-N)

Complete service unavailability

No inference calls succeed. The witness chain has a verifiable gap. The absence of anchors is itself evidence: the gap between the last successful anchor and the first post-recovery anchor defines the outage window with cryptographic precision.

Phase 4: Recovery (Gradual)

Service restoration and baseline comparison

The provider reports recovery, but reported recovery and actual recovery diverge. SWT3 drift detection (AI-DRIFT.1) compares post-recovery inference behavior against the pre-outage baseline. Latency changes, output distribution shifts, or guardrail behavior differences are detected automatically.

Phase 5: Post-Incident (Hours to Days)

Root cause analysis and regulatory reporting

The provider publishes a post-mortem. Your compliance team files incident documentation. SWT3 provides the independent evidence chain: exact timestamps, affected procedures, behavioral drift, and recovery verification anchors. This evidence is yours, not the provider's.

3. SWT3 Procedure Mapping

The following SWT3 procedures directly address evidence requirements during AI provider outages. Each procedure is mapped to its outage-phase relevance and the compliance gap it closes.

SWT3 Procedure	Name	Outage Phase	Evidence Produced	Coverage
AI-CHAIN.1	Chain of Custody	All phases	Ordered witness chain with gap detection	Full
AI-CHAIN.2	Chain Trust Level	Recovery	Trust level enforcement across handoffs	Full
AI-INCIDENT.1	Incident Witnessing	Onset + Post	Timestamped incident record with classification	Full
AI-DRIFT.1	Drift Detection	Recovery	Pre/post baseline comparison	Full
AI-ROBUST.1	Robustness Testing	Recovery	Stress test results post-restoration	Full
AI-PERF.1	Performance Monitoring	All phases	Latency, throughput, error rates	Full
AI-INF.1	Inference Witnessing	Pre + Recovery	Per-call evidence with model ID and hashes	Full
AI-INF.2	Latency Monitoring	Pre + Recovery	Response time baseline and deviation	Full
AI-REV.1	Anchor Revocation	Post-incident	Revocation of compromised anchors	Full
AI-SUPPLY.1	Supply Chain	Post-incident	Provider dependency documentation	Full
AI-MULTI.1	Multi-Agent Coordination	Onset	Agent interaction chain during failure	Full
AI-SAFE.1	Safety Constraints	Recovery	Guardrail verification post-restoration	Full
AI-AUDIT.1	Audit Trail	All phases	Immutable compliance event log	Full
AI-PMM.1	Post-Market Monitoring	Post-incident	Ongoing surveillance after recovery	Full
AI-CYBER.1	Cybersecurity Measures	Onset + During	Security posture during degraded state	Partial
AI-TRANS.1	Transparency Record	Post-incident	Disclosure of outage impact to stakeholders	Partial

Coverage key: Full = SWT3 procedure directly produces outage-relevant evidence with no additional tooling. Partial = SWT3 provides supporting evidence; organizational process completes the requirement.

4. Before: Continuous Witness Chain

The value of SWT3 during an outage depends entirely on what was recorded before the outage. Every inference call wrapped with SWT3 produces a witness anchor containing:

Cryptographic hash of the prompt and response (content never leaves your infrastructure)
Model identifier and provider
Latency in milliseconds
Input and output token counts
Guardrail pass/fail count and guardrail names
Millisecond-precision timestamp in the fingerprint formula
Chain linkage via cycle_id for multi-step workflows

This creates a dense, continuous chain of evidence. The chain's density determines the precision of gap detection: if you witness every call, the gap window is bounded by your average call interval. If you witness hourly, the gap could be up to 60 minutes wider than reality.

// TypeScript: Every inference call is witnessed automatically
import { createWitness } from '@tenova/swt3-ai';

const witness = createWitness({
  endpoint: 'https://sovereign.tenova.io/api/v1/witness',
  apiKey: 'axm_live_...',
  tenantId: 'your-tenant-id',
  clearingLevel: 1,
  agentId: 'billing-agent-prod',
  cycleId: `session-${sessionId}`,  // Links all calls in this workflow
});

// wrap() intercepts every call and mints an anchor
const client = witness.wrap(new OpenAI());

5. During: Gap Detection and Incident Evidence

When the provider goes down, the witness chain stops. This absence is the evidence.

Chain gap detection (AI-CHAIN.1)

The cycle_id field links sequential calls within a workflow. When a chain has anchors at timestamps T1, T2, T3, then nothing until T7, the gap between T3 and T7 is a provable outage window. No manual logging required. The gap proves itself.

Failed-call witnessing

SWT3 witnesses both successful and failed inference calls. During the onset phase, the SDK records timeout errors, HTTP 5xx responses, and connection failures as anchors with factor values indicating failure. This creates evidence of:

Exactly when failures began (first failed anchor timestamp)
Retry behavior and frequency (anchor count during onset phase)
Error classification (timeout vs. server error vs. connection refused)
Which agents or workflows were affected (filtered by agent_id)

Sentinel daemon (continuous monitoring)

The SWT3 Sentinel daemon runs as an independent process, monitoring the witness chain for anomalies. During an outage, the Sentinel detects chain gaps in real-time and can trigger alerts or failover procedures. Because the Sentinel operates independently of the AI provider, it continues functioning when the provider is down.

6. After: Recovery Verification

The provider says they are back. How do you verify?

Drift detection (AI-DRIFT.1)

SWT3 drift detection compares post-recovery inference behavior against the pre-outage baseline. Key metrics:

Latency distribution: Has the P50/P95/P99 shifted? Degraded performance after recovery is common and may indicate the provider restored service on reduced capacity.
Output distribution: Are response patterns consistent? A provider recovering from an outage may fail over to a different model version, checkpoint, or data center.
Guardrail behavior: Do the same inputs trigger the same guardrails? Changes here indicate potential model substitution or configuration drift during recovery.
Token counts: Unusual output length changes can signal model version differences.

# Python: Post-recovery drift check
from swt3_ai import SWT3Witness

witness = SWT3Witness(
    endpoint='https://sovereign.tenova.io/api/v1/witness',
    api_key='axm_live_...',
    tenant_id='your-tenant-id',
)

# The witness chain automatically establishes
# pre-outage baseline vs post-recovery behavior.
# Query the drift API to compare windows:
# GET /api/v1/ai-witness?model_id=gpt-4&drift=true

Anchor revocation (AI-REV.1)

If post-incident analysis reveals that in-flight requests during the outage onset produced corrupted or unreliable results, those anchors can be revoked:

// Revoke anchors from the failure window
witness.revoke('a1b2c3d4e5f6', 'error_correction');
// Reason codes: model_recall, policy_violation,
// data_contamination, consent_withdrawal,
// regulatory_order, error_correction, unspecified

Revocation mints an AI-REV.1 anchor that references the original fingerprint. The original anchor remains in the ledger (immutability is preserved), but verification queries return the revocation status. This is critical for audit trails: you can prove that unreliable results were identified and formally invalidated.

7. Provider Inheritance and Shared Responsibility

Many organizations inherit compliance claims from their AI providers. Common inherited assertions include:

"Our AI provider maintains 99.9% uptime" (availability)
"The provider monitors model performance continuously" (monitoring)
"The provider encrypts data in transit and at rest" (data protection)
"The provider's infrastructure is SOC 2 Type II certified" (security)

An outage invalidates the availability claim immediately. But it also raises questions about the monitoring, security, and integrity claims: if the provider cannot maintain uptime, what else in their attestation chain is weaker than stated?

SA-9 (External System Services) / NIST 800-53: Organizations must "require external system service providers to identify the functions, ports, protocols, and other services required for the use of such services." A provider outage demonstrates why inherited controls require independent verification. SWT3 provides this verification through continuous witnessing that does not depend on the provider's own monitoring infrastructure.

Independent evidence vs. inherited claims

Claim Type	Provider Says	During Outage	SWT3 Evidence
Availability	"99.9% uptime SLA"	Unverifiable	Chain gap timestamps prove exact downtime window
Monitoring	"We monitor 24/7"	Provider monitoring also failed	Sentinel daemon operates independently
Integrity	"Model outputs are consistent"	Unknown until recovery	Drift detection compares pre/post baselines
Incident Response	"We notify within 72 hours"	Waiting for provider disclosure	AI-INCIDENT.1 anchor records your detection time independently
Recovery	"Service restored at T"	No independent verification	First successful post-outage anchor proves actual recovery time

8. Multi-Agent and Agentic Failures

Agentic AI architectures introduce a unique failure mode during provider outages: autonomous retry cascades.

When an AI agent encounters a provider failure, its default behavior is often to retry. In multi-agent systems where agents delegate to other agents, a single provider outage can trigger a cascade where every agent in the system simultaneously retries, compounding load on the failing provider and potentially delaying recovery for all customers.

In the worst case, a sub-agent designed to execute tasks autonomously continues generating requests against a failing endpoint, effectively creating an internal amplification loop. The agent does not distinguish between "the provider is slow" and "the provider is down," so it keeps trying with increasing urgency. At scale, dozens or hundreds of autonomous agents retrying simultaneously can compound the original failure.

SWT3 protections for agentic failure

AI-MULTI.1 (Multi-Agent Coordination): Witnesses inter-agent handoffs. During an outage, the coordination chain shows exactly which agent triggered the retry cascade and how it propagated.
AI-CHAIN.1 + cycle_id: Links all calls within an agentic workflow. Post-incident, you can reconstruct the full cascade: which agent called which, how many retries occurred, and where the circuit should have broken.
ChainEnforcer (strict mode): The SWT3 ChainEnforcer can be configured with a chainMinTrustLevel. When trust level drops below the threshold (e.g., repeated failures lower the effective trust), the enforcer blocks further calls rather than allowing unbounded retries.
Token budget: The tokenBudget configuration limits total tokens consumed per chain. A runaway agent hitting a failing endpoint still consumes tokens for each attempt. When the budget is exhausted, the chain stops, preventing infinite retry loops.
AI-SAFE.1 (Safety Constraints): Guardrail verification ensures that post-recovery agents re-establish safety boundaries before resuming autonomous operation.

Architecture note: Agentic systems without independent circuit breakers will amplify provider failures. The AI provider's rate limiting may be the only backstop, and during an outage, rate limiting infrastructure may also be degraded. SWT3 token budgets and chain enforcement provide a client-side circuit breaker that operates independently of the provider.

9. Regulatory Requirements

Multiple regulatory frameworks impose specific obligations during AI system disruptions:

EU AI Act

Article	Requirement	SWT3 Procedure
Art. 9(8)	Logging of system operation for traceability	AI-INF.1, AI-AUDIT.1, AI-CHAIN.1
Art. 12	Automatic recording of events during lifecycle	AI-CHAIN.1 (gap = recorded event)
Art. 15(4)	Resilience against errors, faults, inconsistencies	AI-ROBUST.1, AI-DRIFT.1
Art. 62	Serious incident reporting	AI-INCIDENT.1
Art. 72	Post-market monitoring	AI-PMM.1, AI-DRIFT.1

NIST AI RMF

Function	Category	Outage Relevance	SWT3 Procedure
Govern	GV-1.3	Incident response processes	AI-INCIDENT.1
Measure	MS-2.6	Performance monitoring	AI-PERF.1, AI-INF.2
Measure	MS-2.7	Drift and degradation tracking	AI-DRIFT.1
Manage	MG-3.1	Incident escalation	AI-INCIDENT.1, AI-AUDIT.1
Manage	MG-4.1	Post-deployment monitoring	AI-PMM.1

NIST 800-53

Control	Title	Outage Relevance	SWT3 Procedure
CP-2	Contingency Plan	AI failover documentation	AI-SUPPLY.1
IR-4	Incident Handling	Detection, analysis, containment	AI-INCIDENT.1
IR-6	Incident Reporting	Timely notification to authorities	AI-INCIDENT.1, AI-TRANS.1
SA-9	External System Services	Provider inheritance validation	AI-SUPPLY.1, AI-CHAIN.1
SI-4	System Monitoring	Independent monitoring capability	Sentinel daemon, AI-PERF.1
SI-7	Software/Info Integrity	Post-recovery integrity verification	AI-DRIFT.1, AI-SAFE.1

SR 11-7 (Model Risk Management)

For financial institutions, a provider outage affecting model inference triggers MRM obligations: the model risk function must document the disruption, assess whether model outputs during the degradation window are reliable, and validate post-recovery model behavior. SWT3 chain gaps, drift detection, and anchor revocation provide the evidence artifacts required by the MRM framework.

10. Implementation Guide

Minimum configuration for outage-resilient witnessing:

// TypeScript: Outage-resilient witness configuration
import { createWitness } from '@tenova/swt3-ai';

const witness = createWitness({
  endpoint: 'https://sovereign.tenova.io/api/v1/witness',
  apiKey: process.env.SWT3_API_KEY,
  tenantId: process.env.SWT3_TENANT_ID,
  clearingLevel: 1,

  // Identity: know which agent failed
  agentId: 'my-agent-prod',

  // Chain linkage: detect gaps per workflow
  cycleId: `session-${crypto.randomUUID()}`,

  // Circuit breaker: prevent runaway retries
  tokenBudget: 50000,
  chainMinTrustLevel: 2,

  // Signing: tamper-evident chain
  signingKey: process.env.SWT3_SIGNING_KEY,
  signingAlgorithm: 'hmac-sha256',

  // Flush callback: alert on anomalies
  onFlush: (payloads, receipts) => {
    const failures = payloads.filter(p => p.factor_a === 0);
    if (failures.length > 3) {
      alertOps('AI provider degradation detected', failures);
    }
  },
});

# Python: Equivalent configuration
from swt3_ai import SWT3Witness

witness = SWT3Witness(
    endpoint='https://sovereign.tenova.io/api/v1/witness',
    api_key=os.environ['SWT3_API_KEY'],
    tenant_id=os.environ['SWT3_TENANT_ID'],
    clearing_level=1,
    agent_id='my-agent-prod',
    cycle_id=f'session-{uuid4()}',
    signing_key=os.environ['SWT3_SIGNING_KEY'],
    signing_algorithm='hmac-sha256',
    on_flush=lambda payloads, receipts: alert_if_degraded(payloads),
)

Post-outage checklist

Query the witness ledger for the last successful anchor before the gap
Query for the first successful anchor after recovery
Calculate the exact outage window (anchor timestamps, not provider's claimed times)
Run drift detection comparing 24-hour windows before and after
Verify guardrail behavior is consistent post-recovery (AI-SAFE.1)
Revoke any anchors from the degradation onset window if outputs are unreliable (AI-REV.1)
Mint an AI-INCIDENT.1 anchor documenting the event
Export evidence bundle for compliance records

11. Clearing Levels During Incidents

During an incident, organizations may need to share more operational detail with regulators or incident response teams than normal operations would permit. SWT3 clearing levels accommodate this:

Level	Name	Normal Use	During Incident
0	Analytics	Internal R&D	Full detail for internal incident team
1	Standard	Production	Regulator-appropriate detail level
2	Sensitive	PII-adjacent	Third-party responders (limited model info)
3	Classified	Sovereign	Cross-agency notification (factors only)

At all clearing levels, the chain gap evidence is preserved. Even at Level 3 (Classified), the timestamp and factor data prove the outage window without exposing model identity, provider, or operational context. This allows classified environments to report incidents up the chain of command without declassifying operational details.

Neutrality statement: TeNova Axiom is an independent evidence platform. It does not grant certifications, assign fault for outages, or replace incident response processes. SWT3 witness anchors record what happened and when. The interpretation of that evidence, the determination of fault, and the regulatory response are the responsibility of the deploying organization and its designated authorities.

Contents

1. The Evidence Gap Problem

2. Anatomy of a Provider Outage

Continuous witnessing establishes the baseline

Timeouts, retries, and cascading failures

Complete service unavailability

Service restoration and baseline comparison

Root cause analysis and regulatory reporting

3. SWT3 Procedure Mapping

4. Before: Continuous Witness Chain

5. During: Gap Detection and Incident Evidence

Chain gap detection (AI-CHAIN.1)

Failed-call witnessing

Sentinel daemon (continuous monitoring)

6. After: Recovery Verification

Drift detection (AI-DRIFT.1)

Anchor revocation (AI-REV.1)

7. Provider Inheritance and Shared Responsibility

Independent evidence vs. inherited claims

8. Multi-Agent and Agentic Failures

SWT3 protections for agentic failure

9. Regulatory Requirements

EU AI Act

NIST AI RMF

NIST 800-53

SR 11-7 (Model Risk Management)

10. Implementation Guide

Post-outage checklist

11. Clearing Levels During Incidents