Contents
1. The Evidence Gap Problem 2. Anatomy of a Provider Outage 3. SWT3 Procedure Mapping 4. Before: Continuous Witness Chain 5. During: Gap Detection and Incident Evidence 6. After: Recovery Verification 7. Provider Inheritance and Shared Responsibility 8. Multi-Agent and Agentic Failures 9. Regulatory Requirements 10. Implementation Guide 11. Clearing Levels During IncidentsAudience: AI compliance officers, platform engineers, risk managers, and auditors responsible for maintaining evidence continuity during AI service disruptions. Applicable to any organization using third-party AI inference providers (OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, etc.).
1. The Evidence Gap Problem
When a major AI provider experiences a global outage, every downstream customer faces the same problem: total evidence blackout. There is no independent record of what was in-flight, what decisions were being processed, what data was exposed to the model, or how long the disruption lasted.
For organizations operating under compliance frameworks, this creates specific failures:
- No proof of gap duration. Without independent timestamps, you cannot demonstrate to an auditor exactly when AI processing stopped and resumed.
- No evidence of in-flight state. Requests that were submitted but never completed leave no trace. In agentic systems, partially executed tool chains vanish.
- No drift baseline comparison. Post-recovery, you cannot verify that model behavior returned to its pre-outage baseline without an independent reference point.
- No incident documentation. Regulatory frameworks require documented incident response. Without witnessing infrastructure, the incident report relies entirely on the provider's disclosure timeline.
- Inherited compliance claims collapse. If your security posture inherits availability, integrity, or monitoring claims from the AI provider, an outage invalidates those inheritance assumptions until independently verified.
2. Anatomy of a Provider Outage
AI provider outages follow a predictable lifecycle with distinct evidence requirements at each phase:
Continuous witnessing establishes the baseline
Every inference call generates a witness anchor with latency, token counts, guardrail results, and model identity. This creates a verifiable record of normal operating parameters. When an outage occurs, the last successful anchor timestamp marks the exact moment of service loss.
Timeouts, retries, and cascading failures
API calls begin failing. In agentic architectures, autonomous agents may retry aggressively, compounding the failure. Without circuit breakers or witness-aware retry logic, agents can generate thousands of failed requests, masking the root cause and creating resource contention across the provider's infrastructure. SWT3 witness anchors record each failed attempt with its error class, enabling post-incident reconstruction of the failure cascade.
Complete service unavailability
No inference calls succeed. The witness chain has a verifiable gap. The absence of anchors is itself evidence: the gap between the last successful anchor and the first post-recovery anchor defines the outage window with cryptographic precision.
Service restoration and baseline comparison
The provider reports recovery, but reported recovery and actual recovery diverge. SWT3 drift detection (AI-DRIFT.1) compares post-recovery inference behavior against the pre-outage baseline. Latency changes, output distribution shifts, or guardrail behavior differences are detected automatically.
Root cause analysis and regulatory reporting
The provider publishes a post-mortem. Your compliance team files incident documentation. SWT3 provides the independent evidence chain: exact timestamps, affected procedures, behavioral drift, and recovery verification anchors. This evidence is yours, not the provider's.
3. SWT3 Procedure Mapping
The following SWT3 procedures directly address evidence requirements during AI provider outages. Each procedure is mapped to its outage-phase relevance and the compliance gap it closes.
| SWT3 Procedure | Name | Outage Phase | Evidence Produced | Coverage |
|---|---|---|---|---|
| AI-CHAIN.1 | Chain of Custody | All phases | Ordered witness chain with gap detection | Full |
| AI-CHAIN.2 | Chain Trust Level | Recovery | Trust level enforcement across handoffs | Full |
| AI-INCIDENT.1 | Incident Witnessing | Onset + Post | Timestamped incident record with classification | Full |
| AI-DRIFT.1 | Drift Detection | Recovery | Pre/post baseline comparison | Full |
| AI-ROBUST.1 | Robustness Testing | Recovery | Stress test results post-restoration | Full |
| AI-PERF.1 | Performance Monitoring | All phases | Latency, throughput, error rates | Full |
| AI-INF.1 | Inference Witnessing | Pre + Recovery | Per-call evidence with model ID and hashes | Full |
| AI-INF.2 | Latency Monitoring | Pre + Recovery | Response time baseline and deviation | Full |
| AI-REV.1 | Anchor Revocation | Post-incident | Revocation of compromised anchors | Full |
| AI-SUPPLY.1 | Supply Chain | Post-incident | Provider dependency documentation | Full |
| AI-MULTI.1 | Multi-Agent Coordination | Onset | Agent interaction chain during failure | Full |
| AI-SAFE.1 | Safety Constraints | Recovery | Guardrail verification post-restoration | Full |
| AI-AUDIT.1 | Audit Trail | All phases | Immutable compliance event log | Full |
| AI-PMM.1 | Post-Market Monitoring | Post-incident | Ongoing surveillance after recovery | Full |
| AI-CYBER.1 | Cybersecurity Measures | Onset + During | Security posture during degraded state | Partial |
| AI-TRANS.1 | Transparency Record | Post-incident | Disclosure of outage impact to stakeholders | Partial |
Coverage key: Full = SWT3 procedure directly produces outage-relevant evidence with no additional tooling. Partial = SWT3 provides supporting evidence; organizational process completes the requirement.
4. Before: Continuous Witness Chain
The value of SWT3 during an outage depends entirely on what was recorded before the outage. Every inference call wrapped with SWT3 produces a witness anchor containing:
- Cryptographic hash of the prompt and response (content never leaves your infrastructure)
- Model identifier and provider
- Latency in milliseconds
- Input and output token counts
- Guardrail pass/fail count and guardrail names
- Millisecond-precision timestamp in the fingerprint formula
- Chain linkage via
cycle_idfor multi-step workflows
This creates a dense, continuous chain of evidence. The chain's density determines the precision of gap detection: if you witness every call, the gap window is bounded by your average call interval. If you witness hourly, the gap could be up to 60 minutes wider than reality.
// TypeScript: Every inference call is witnessed automatically
import { createWitness } from '@tenova/swt3-ai';
const witness = createWitness({
endpoint: 'https://sovereign.tenova.io/api/v1/witness',
apiKey: 'axm_live_...',
tenantId: 'your-tenant-id',
clearingLevel: 1,
agentId: 'billing-agent-prod',
cycleId: `session-${sessionId}`, // Links all calls in this workflow
});
// wrap() intercepts every call and mints an anchor
const client = witness.wrap(new OpenAI());
5. During: Gap Detection and Incident Evidence
When the provider goes down, the witness chain stops. This absence is the evidence.
Chain gap detection (AI-CHAIN.1)
The cycle_id field links sequential calls within a workflow. When a chain has anchors at timestamps T1, T2, T3, then nothing until T7, the gap between T3 and T7 is a provable outage window. No manual logging required. The gap proves itself.
Failed-call witnessing
SWT3 witnesses both successful and failed inference calls. During the onset phase, the SDK records timeout errors, HTTP 5xx responses, and connection failures as anchors with factor values indicating failure. This creates evidence of:
- Exactly when failures began (first failed anchor timestamp)
- Retry behavior and frequency (anchor count during onset phase)
- Error classification (timeout vs. server error vs. connection refused)
- Which agents or workflows were affected (filtered by
agent_id)
Sentinel daemon (continuous monitoring)
The SWT3 Sentinel daemon runs as an independent process, monitoring the witness chain for anomalies. During an outage, the Sentinel detects chain gaps in real-time and can trigger alerts or failover procedures. Because the Sentinel operates independently of the AI provider, it continues functioning when the provider is down.
6. After: Recovery Verification
The provider says they are back. How do you verify?
Drift detection (AI-DRIFT.1)
SWT3 drift detection compares post-recovery inference behavior against the pre-outage baseline. Key metrics:
- Latency distribution: Has the P50/P95/P99 shifted? Degraded performance after recovery is common and may indicate the provider restored service on reduced capacity.
- Output distribution: Are response patterns consistent? A provider recovering from an outage may fail over to a different model version, checkpoint, or data center.
- Guardrail behavior: Do the same inputs trigger the same guardrails? Changes here indicate potential model substitution or configuration drift during recovery.
- Token counts: Unusual output length changes can signal model version differences.
# Python: Post-recovery drift check
from swt3_ai import SWT3Witness
witness = SWT3Witness(
endpoint='https://sovereign.tenova.io/api/v1/witness',
api_key='axm_live_...',
tenant_id='your-tenant-id',
)
# The witness chain automatically establishes
# pre-outage baseline vs post-recovery behavior.
# Query the drift API to compare windows:
# GET /api/v1/ai-witness?model_id=gpt-4&drift=true
Anchor revocation (AI-REV.1)
If post-incident analysis reveals that in-flight requests during the outage onset produced corrupted or unreliable results, those anchors can be revoked:
// Revoke anchors from the failure window
witness.revoke('a1b2c3d4e5f6', 'error_correction');
// Reason codes: model_recall, policy_violation,
// data_contamination, consent_withdrawal,
// regulatory_order, error_correction, unspecified
Revocation mints an AI-REV.1 anchor that references the original fingerprint. The original anchor remains in the ledger (immutability is preserved), but verification queries return the revocation status. This is critical for audit trails: you can prove that unreliable results were identified and formally invalidated.
7. Provider Inheritance and Shared Responsibility
Many organizations inherit compliance claims from their AI providers. Common inherited assertions include:
- "Our AI provider maintains 99.9% uptime" (availability)
- "The provider monitors model performance continuously" (monitoring)
- "The provider encrypts data in transit and at rest" (data protection)
- "The provider's infrastructure is SOC 2 Type II certified" (security)
An outage invalidates the availability claim immediately. But it also raises questions about the monitoring, security, and integrity claims: if the provider cannot maintain uptime, what else in their attestation chain is weaker than stated?
Independent evidence vs. inherited claims
| Claim Type | Provider Says | During Outage | SWT3 Evidence |
|---|---|---|---|
| Availability | "99.9% uptime SLA" | Unverifiable | Chain gap timestamps prove exact downtime window |
| Monitoring | "We monitor 24/7" | Provider monitoring also failed | Sentinel daemon operates independently |
| Integrity | "Model outputs are consistent" | Unknown until recovery | Drift detection compares pre/post baselines |
| Incident Response | "We notify within 72 hours" | Waiting for provider disclosure | AI-INCIDENT.1 anchor records your detection time independently |
| Recovery | "Service restored at T" | No independent verification | First successful post-outage anchor proves actual recovery time |
8. Multi-Agent and Agentic Failures
Agentic AI architectures introduce a unique failure mode during provider outages: autonomous retry cascades.
When an AI agent encounters a provider failure, its default behavior is often to retry. In multi-agent systems where agents delegate to other agents, a single provider outage can trigger a cascade where every agent in the system simultaneously retries, compounding load on the failing provider and potentially delaying recovery for all customers.
In the worst case, a sub-agent designed to execute tasks autonomously continues generating requests against a failing endpoint, effectively creating an internal amplification loop. The agent does not distinguish between "the provider is slow" and "the provider is down," so it keeps trying with increasing urgency. At scale, dozens or hundreds of autonomous agents retrying simultaneously can compound the original failure.
SWT3 protections for agentic failure
- AI-MULTI.1 (Multi-Agent Coordination): Witnesses inter-agent handoffs. During an outage, the coordination chain shows exactly which agent triggered the retry cascade and how it propagated.
- AI-CHAIN.1 + cycle_id: Links all calls within an agentic workflow. Post-incident, you can reconstruct the full cascade: which agent called which, how many retries occurred, and where the circuit should have broken.
- ChainEnforcer (strict mode): The SWT3 ChainEnforcer can be configured with a
chainMinTrustLevel. When trust level drops below the threshold (e.g., repeated failures lower the effective trust), the enforcer blocks further calls rather than allowing unbounded retries. - Token budget: The
tokenBudgetconfiguration limits total tokens consumed per chain. A runaway agent hitting a failing endpoint still consumes tokens for each attempt. When the budget is exhausted, the chain stops, preventing infinite retry loops. - AI-SAFE.1 (Safety Constraints): Guardrail verification ensures that post-recovery agents re-establish safety boundaries before resuming autonomous operation.
9. Regulatory Requirements
Multiple regulatory frameworks impose specific obligations during AI system disruptions:
EU AI Act
| Article | Requirement | SWT3 Procedure |
|---|---|---|
| Art. 9(8) | Logging of system operation for traceability | AI-INF.1, AI-AUDIT.1, AI-CHAIN.1 |
| Art. 12 | Automatic recording of events during lifecycle | AI-CHAIN.1 (gap = recorded event) |
| Art. 15(4) | Resilience against errors, faults, inconsistencies | AI-ROBUST.1, AI-DRIFT.1 |
| Art. 62 | Serious incident reporting | AI-INCIDENT.1 |
| Art. 72 | Post-market monitoring | AI-PMM.1, AI-DRIFT.1 |
NIST AI RMF
| Function | Category | Outage Relevance | SWT3 Procedure |
|---|---|---|---|
| Govern | GV-1.3 | Incident response processes | AI-INCIDENT.1 |
| Measure | MS-2.6 | Performance monitoring | AI-PERF.1, AI-INF.2 |
| Measure | MS-2.7 | Drift and degradation tracking | AI-DRIFT.1 |
| Manage | MG-3.1 | Incident escalation | AI-INCIDENT.1, AI-AUDIT.1 |
| Manage | MG-4.1 | Post-deployment monitoring | AI-PMM.1 |
NIST 800-53
| Control | Title | Outage Relevance | SWT3 Procedure |
|---|---|---|---|
| CP-2 | Contingency Plan | AI failover documentation | AI-SUPPLY.1 |
| IR-4 | Incident Handling | Detection, analysis, containment | AI-INCIDENT.1 |
| IR-6 | Incident Reporting | Timely notification to authorities | AI-INCIDENT.1, AI-TRANS.1 |
| SA-9 | External System Services | Provider inheritance validation | AI-SUPPLY.1, AI-CHAIN.1 |
| SI-4 | System Monitoring | Independent monitoring capability | Sentinel daemon, AI-PERF.1 |
| SI-7 | Software/Info Integrity | Post-recovery integrity verification | AI-DRIFT.1, AI-SAFE.1 |
SR 11-7 (Model Risk Management)
For financial institutions, a provider outage affecting model inference triggers MRM obligations: the model risk function must document the disruption, assess whether model outputs during the degradation window are reliable, and validate post-recovery model behavior. SWT3 chain gaps, drift detection, and anchor revocation provide the evidence artifacts required by the MRM framework.
10. Implementation Guide
Minimum configuration for outage-resilient witnessing:
// TypeScript: Outage-resilient witness configuration
import { createWitness } from '@tenova/swt3-ai';
const witness = createWitness({
endpoint: 'https://sovereign.tenova.io/api/v1/witness',
apiKey: process.env.SWT3_API_KEY,
tenantId: process.env.SWT3_TENANT_ID,
clearingLevel: 1,
// Identity: know which agent failed
agentId: 'my-agent-prod',
// Chain linkage: detect gaps per workflow
cycleId: `session-${crypto.randomUUID()}`,
// Circuit breaker: prevent runaway retries
tokenBudget: 50000,
chainMinTrustLevel: 2,
// Signing: tamper-evident chain
signingKey: process.env.SWT3_SIGNING_KEY,
signingAlgorithm: 'hmac-sha256',
// Flush callback: alert on anomalies
onFlush: (payloads, receipts) => {
const failures = payloads.filter(p => p.factor_a === 0);
if (failures.length > 3) {
alertOps('AI provider degradation detected', failures);
}
},
});
# Python: Equivalent configuration
from swt3_ai import SWT3Witness
witness = SWT3Witness(
endpoint='https://sovereign.tenova.io/api/v1/witness',
api_key=os.environ['SWT3_API_KEY'],
tenant_id=os.environ['SWT3_TENANT_ID'],
clearing_level=1,
agent_id='my-agent-prod',
cycle_id=f'session-{uuid4()}',
signing_key=os.environ['SWT3_SIGNING_KEY'],
signing_algorithm='hmac-sha256',
on_flush=lambda payloads, receipts: alert_if_degraded(payloads),
)
Post-outage checklist
- Query the witness ledger for the last successful anchor before the gap
- Query for the first successful anchor after recovery
- Calculate the exact outage window (anchor timestamps, not provider's claimed times)
- Run drift detection comparing 24-hour windows before and after
- Verify guardrail behavior is consistent post-recovery (AI-SAFE.1)
- Revoke any anchors from the degradation onset window if outputs are unreliable (AI-REV.1)
- Mint an AI-INCIDENT.1 anchor documenting the event
- Export evidence bundle for compliance records
11. Clearing Levels During Incidents
During an incident, organizations may need to share more operational detail with regulators or incident response teams than normal operations would permit. SWT3 clearing levels accommodate this:
| Level | Name | Normal Use | During Incident |
|---|---|---|---|
| 0 | Analytics | Internal R&D | Full detail for internal incident team |
| 1 | Standard | Production | Regulator-appropriate detail level |
| 2 | Sensitive | PII-adjacent | Third-party responders (limited model info) |
| 3 | Classified | Sovereign | Cross-agency notification (factors only) |
At all clearing levels, the chain gap evidence is preserved. Even at Level 3 (Classified), the timestamp and factor data prove the outage window without exposing model identity, provider, or operational context. This allows classified environments to report incidents up the chain of command without declassifying operational details.