Contents

1. The Evidence Gap Problem 2. Anatomy of a Provider Outage 3. SWT3 Procedure Mapping 4. Before: Continuous Witness Chain 5. During: Gap Detection and Incident Evidence 6. After: Recovery Verification 7. Provider Inheritance and Shared Responsibility 8. Multi-Agent and Agentic Failures 9. Regulatory Requirements 10. Implementation Guide 11. Clearing Levels During Incidents

Audience: AI compliance officers, platform engineers, risk managers, and auditors responsible for maintaining evidence continuity during AI service disruptions. Applicable to any organization using third-party AI inference providers (OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, etc.).

1. The Evidence Gap Problem

When a major AI provider experiences a global outage, every downstream customer faces the same problem: total evidence blackout. There is no independent record of what was in-flight, what decisions were being processed, what data was exposed to the model, or how long the disruption lasted.

For organizations operating under compliance frameworks, this creates specific failures:

The core issue: Provider-side logging and monitoring fail at exactly the moment you need them most. When the provider is down, their telemetry is down with them. An independent witness layer is the only architecture that survives provider failure.

2. Anatomy of a Provider Outage

AI provider outages follow a predictable lifecycle with distinct evidence requirements at each phase:

Phase 1: Pre-Failure (Normal Operations)

Continuous witnessing establishes the baseline

Every inference call generates a witness anchor with latency, token counts, guardrail results, and model identity. This creates a verifiable record of normal operating parameters. When an outage occurs, the last successful anchor timestamp marks the exact moment of service loss.

Phase 2: Failure Onset (Minutes 0-5)

Timeouts, retries, and cascading failures

API calls begin failing. In agentic architectures, autonomous agents may retry aggressively, compounding the failure. Without circuit breakers or witness-aware retry logic, agents can generate thousands of failed requests, masking the root cause and creating resource contention across the provider's infrastructure. SWT3 witness anchors record each failed attempt with its error class, enabling post-incident reconstruction of the failure cascade.

Phase 3: Full Outage (Minutes 5-N)

Complete service unavailability

No inference calls succeed. The witness chain has a verifiable gap. The absence of anchors is itself evidence: the gap between the last successful anchor and the first post-recovery anchor defines the outage window with cryptographic precision.

Phase 4: Recovery (Gradual)

Service restoration and baseline comparison

The provider reports recovery, but reported recovery and actual recovery diverge. SWT3 drift detection (AI-DRIFT.1) compares post-recovery inference behavior against the pre-outage baseline. Latency changes, output distribution shifts, or guardrail behavior differences are detected automatically.

Phase 5: Post-Incident (Hours to Days)

Root cause analysis and regulatory reporting

The provider publishes a post-mortem. Your compliance team files incident documentation. SWT3 provides the independent evidence chain: exact timestamps, affected procedures, behavioral drift, and recovery verification anchors. This evidence is yours, not the provider's.

3. SWT3 Procedure Mapping

The following SWT3 procedures directly address evidence requirements during AI provider outages. Each procedure is mapped to its outage-phase relevance and the compliance gap it closes.

SWT3 ProcedureNameOutage PhaseEvidence ProducedCoverage
AI-CHAIN.1Chain of CustodyAll phasesOrdered witness chain with gap detectionFull
AI-CHAIN.2Chain Trust LevelRecoveryTrust level enforcement across handoffsFull
AI-INCIDENT.1Incident WitnessingOnset + PostTimestamped incident record with classificationFull
AI-DRIFT.1Drift DetectionRecoveryPre/post baseline comparisonFull
AI-ROBUST.1Robustness TestingRecoveryStress test results post-restorationFull
AI-PERF.1Performance MonitoringAll phasesLatency, throughput, error ratesFull
AI-INF.1Inference WitnessingPre + RecoveryPer-call evidence with model ID and hashesFull
AI-INF.2Latency MonitoringPre + RecoveryResponse time baseline and deviationFull
AI-REV.1Anchor RevocationPost-incidentRevocation of compromised anchorsFull
AI-SUPPLY.1Supply ChainPost-incidentProvider dependency documentationFull
AI-MULTI.1Multi-Agent CoordinationOnsetAgent interaction chain during failureFull
AI-SAFE.1Safety ConstraintsRecoveryGuardrail verification post-restorationFull
AI-AUDIT.1Audit TrailAll phasesImmutable compliance event logFull
AI-PMM.1Post-Market MonitoringPost-incidentOngoing surveillance after recoveryFull
AI-CYBER.1Cybersecurity MeasuresOnset + DuringSecurity posture during degraded statePartial
AI-TRANS.1Transparency RecordPost-incidentDisclosure of outage impact to stakeholdersPartial

Coverage key: Full = SWT3 procedure directly produces outage-relevant evidence with no additional tooling. Partial = SWT3 provides supporting evidence; organizational process completes the requirement.

4. Before: Continuous Witness Chain

The value of SWT3 during an outage depends entirely on what was recorded before the outage. Every inference call wrapped with SWT3 produces a witness anchor containing:

This creates a dense, continuous chain of evidence. The chain's density determines the precision of gap detection: if you witness every call, the gap window is bounded by your average call interval. If you witness hourly, the gap could be up to 60 minutes wider than reality.

// TypeScript: Every inference call is witnessed automatically
import { createWitness } from '@tenova/swt3-ai';

const witness = createWitness({
  endpoint: 'https://sovereign.tenova.io/api/v1/witness',
  apiKey: 'axm_live_...',
  tenantId: 'your-tenant-id',
  clearingLevel: 1,
  agentId: 'billing-agent-prod',
  cycleId: `session-${sessionId}`,  // Links all calls in this workflow
});

// wrap() intercepts every call and mints an anchor
const client = witness.wrap(new OpenAI());

5. During: Gap Detection and Incident Evidence

When the provider goes down, the witness chain stops. This absence is the evidence.

Chain gap detection (AI-CHAIN.1)

The cycle_id field links sequential calls within a workflow. When a chain has anchors at timestamps T1, T2, T3, then nothing until T7, the gap between T3 and T7 is a provable outage window. No manual logging required. The gap proves itself.

Failed-call witnessing

SWT3 witnesses both successful and failed inference calls. During the onset phase, the SDK records timeout errors, HTTP 5xx responses, and connection failures as anchors with factor values indicating failure. This creates evidence of:

Sentinel daemon (continuous monitoring)

The SWT3 Sentinel daemon runs as an independent process, monitoring the witness chain for anomalies. During an outage, the Sentinel detects chain gaps in real-time and can trigger alerts or failover procedures. Because the Sentinel operates independently of the AI provider, it continues functioning when the provider is down.

6. After: Recovery Verification

The provider says they are back. How do you verify?

Drift detection (AI-DRIFT.1)

SWT3 drift detection compares post-recovery inference behavior against the pre-outage baseline. Key metrics:

# Python: Post-recovery drift check
from swt3_ai import SWT3Witness

witness = SWT3Witness(
    endpoint='https://sovereign.tenova.io/api/v1/witness',
    api_key='axm_live_...',
    tenant_id='your-tenant-id',
)

# The witness chain automatically establishes
# pre-outage baseline vs post-recovery behavior.
# Query the drift API to compare windows:
# GET /api/v1/ai-witness?model_id=gpt-4&drift=true

Anchor revocation (AI-REV.1)

If post-incident analysis reveals that in-flight requests during the outage onset produced corrupted or unreliable results, those anchors can be revoked:

// Revoke anchors from the failure window
witness.revoke('a1b2c3d4e5f6', 'error_correction');
// Reason codes: model_recall, policy_violation,
// data_contamination, consent_withdrawal,
// regulatory_order, error_correction, unspecified

Revocation mints an AI-REV.1 anchor that references the original fingerprint. The original anchor remains in the ledger (immutability is preserved), but verification queries return the revocation status. This is critical for audit trails: you can prove that unreliable results were identified and formally invalidated.

7. Provider Inheritance and Shared Responsibility

Many organizations inherit compliance claims from their AI providers. Common inherited assertions include:

An outage invalidates the availability claim immediately. But it also raises questions about the monitoring, security, and integrity claims: if the provider cannot maintain uptime, what else in their attestation chain is weaker than stated?

SA-9 (External System Services) / NIST 800-53: Organizations must "require external system service providers to identify the functions, ports, protocols, and other services required for the use of such services." A provider outage demonstrates why inherited controls require independent verification. SWT3 provides this verification through continuous witnessing that does not depend on the provider's own monitoring infrastructure.

Independent evidence vs. inherited claims

Claim TypeProvider SaysDuring OutageSWT3 Evidence
Availability"99.9% uptime SLA"UnverifiableChain gap timestamps prove exact downtime window
Monitoring"We monitor 24/7"Provider monitoring also failedSentinel daemon operates independently
Integrity"Model outputs are consistent"Unknown until recoveryDrift detection compares pre/post baselines
Incident Response"We notify within 72 hours"Waiting for provider disclosureAI-INCIDENT.1 anchor records your detection time independently
Recovery"Service restored at T"No independent verificationFirst successful post-outage anchor proves actual recovery time

8. Multi-Agent and Agentic Failures

Agentic AI architectures introduce a unique failure mode during provider outages: autonomous retry cascades.

When an AI agent encounters a provider failure, its default behavior is often to retry. In multi-agent systems where agents delegate to other agents, a single provider outage can trigger a cascade where every agent in the system simultaneously retries, compounding load on the failing provider and potentially delaying recovery for all customers.

In the worst case, a sub-agent designed to execute tasks autonomously continues generating requests against a failing endpoint, effectively creating an internal amplification loop. The agent does not distinguish between "the provider is slow" and "the provider is down," so it keeps trying with increasing urgency. At scale, dozens or hundreds of autonomous agents retrying simultaneously can compound the original failure.

SWT3 protections for agentic failure

Architecture note: Agentic systems without independent circuit breakers will amplify provider failures. The AI provider's rate limiting may be the only backstop, and during an outage, rate limiting infrastructure may also be degraded. SWT3 token budgets and chain enforcement provide a client-side circuit breaker that operates independently of the provider.

9. Regulatory Requirements

Multiple regulatory frameworks impose specific obligations during AI system disruptions:

EU AI Act

ArticleRequirementSWT3 Procedure
Art. 9(8)Logging of system operation for traceabilityAI-INF.1, AI-AUDIT.1, AI-CHAIN.1
Art. 12Automatic recording of events during lifecycleAI-CHAIN.1 (gap = recorded event)
Art. 15(4)Resilience against errors, faults, inconsistenciesAI-ROBUST.1, AI-DRIFT.1
Art. 62Serious incident reportingAI-INCIDENT.1
Art. 72Post-market monitoringAI-PMM.1, AI-DRIFT.1

NIST AI RMF

FunctionCategoryOutage RelevanceSWT3 Procedure
GovernGV-1.3Incident response processesAI-INCIDENT.1
MeasureMS-2.6Performance monitoringAI-PERF.1, AI-INF.2
MeasureMS-2.7Drift and degradation trackingAI-DRIFT.1
ManageMG-3.1Incident escalationAI-INCIDENT.1, AI-AUDIT.1
ManageMG-4.1Post-deployment monitoringAI-PMM.1

NIST 800-53

ControlTitleOutage RelevanceSWT3 Procedure
CP-2Contingency PlanAI failover documentationAI-SUPPLY.1
IR-4Incident HandlingDetection, analysis, containmentAI-INCIDENT.1
IR-6Incident ReportingTimely notification to authoritiesAI-INCIDENT.1, AI-TRANS.1
SA-9External System ServicesProvider inheritance validationAI-SUPPLY.1, AI-CHAIN.1
SI-4System MonitoringIndependent monitoring capabilitySentinel daemon, AI-PERF.1
SI-7Software/Info IntegrityPost-recovery integrity verificationAI-DRIFT.1, AI-SAFE.1

SR 11-7 (Model Risk Management)

For financial institutions, a provider outage affecting model inference triggers MRM obligations: the model risk function must document the disruption, assess whether model outputs during the degradation window are reliable, and validate post-recovery model behavior. SWT3 chain gaps, drift detection, and anchor revocation provide the evidence artifacts required by the MRM framework.

10. Implementation Guide

Minimum configuration for outage-resilient witnessing:

// TypeScript: Outage-resilient witness configuration
import { createWitness } from '@tenova/swt3-ai';

const witness = createWitness({
  endpoint: 'https://sovereign.tenova.io/api/v1/witness',
  apiKey: process.env.SWT3_API_KEY,
  tenantId: process.env.SWT3_TENANT_ID,
  clearingLevel: 1,

  // Identity: know which agent failed
  agentId: 'my-agent-prod',

  // Chain linkage: detect gaps per workflow
  cycleId: `session-${crypto.randomUUID()}`,

  // Circuit breaker: prevent runaway retries
  tokenBudget: 50000,
  chainMinTrustLevel: 2,

  // Signing: tamper-evident chain
  signingKey: process.env.SWT3_SIGNING_KEY,
  signingAlgorithm: 'hmac-sha256',

  // Flush callback: alert on anomalies
  onFlush: (payloads, receipts) => {
    const failures = payloads.filter(p => p.factor_a === 0);
    if (failures.length > 3) {
      alertOps('AI provider degradation detected', failures);
    }
  },
});
# Python: Equivalent configuration
from swt3_ai import SWT3Witness

witness = SWT3Witness(
    endpoint='https://sovereign.tenova.io/api/v1/witness',
    api_key=os.environ['SWT3_API_KEY'],
    tenant_id=os.environ['SWT3_TENANT_ID'],
    clearing_level=1,
    agent_id='my-agent-prod',
    cycle_id=f'session-{uuid4()}',
    signing_key=os.environ['SWT3_SIGNING_KEY'],
    signing_algorithm='hmac-sha256',
    on_flush=lambda payloads, receipts: alert_if_degraded(payloads),
)

Post-outage checklist

  1. Query the witness ledger for the last successful anchor before the gap
  2. Query for the first successful anchor after recovery
  3. Calculate the exact outage window (anchor timestamps, not provider's claimed times)
  4. Run drift detection comparing 24-hour windows before and after
  5. Verify guardrail behavior is consistent post-recovery (AI-SAFE.1)
  6. Revoke any anchors from the degradation onset window if outputs are unreliable (AI-REV.1)
  7. Mint an AI-INCIDENT.1 anchor documenting the event
  8. Export evidence bundle for compliance records

11. Clearing Levels During Incidents

During an incident, organizations may need to share more operational detail with regulators or incident response teams than normal operations would permit. SWT3 clearing levels accommodate this:

LevelNameNormal UseDuring Incident
0AnalyticsInternal R&DFull detail for internal incident team
1StandardProductionRegulator-appropriate detail level
2SensitivePII-adjacentThird-party responders (limited model info)
3ClassifiedSovereignCross-agency notification (factors only)

At all clearing levels, the chain gap evidence is preserved. Even at Level 3 (Classified), the timestamp and factor data prove the outage window without exposing model identity, provider, or operational context. This allows classified environments to report incidents up the chain of command without declassifying operational details.

Neutrality statement: TeNova Axiom is an independent evidence platform. It does not grant certifications, assign fault for outages, or replace incident response processes. SWT3 witness anchors record what happened and when. The interpretation of that evidence, the determination of fault, and the regulatory response are the responsibility of the deploying organization and its designated authorities.