This section goes beyond listing metrics. Assessors (Notified Bodies under Article 43) evaluate whether the metrics are fit for purpose -- not just whether they exist. A latency SLA is appropriate for real-time fraud detection but irrelevant for a batch classification system. Demographic parity is appropriate for hiring but may be inappropriate for medical triage.
Axiom auto-populates Annex IV Section 4 evidence from the following AI procedures. The AUTO tag indicates the SWT3 SDK generates this evidence without additional deployer action. MANUAL requires deployer input.
| SWT3 Procedure | Metric Domain | What It Proves | Source |
|---|---|---|---|
AI-INF.2 |
Latency / Response Time | Model responds within defined SLA threshold. Factor_a = threshold (ms), factor_b = actual (ms). | AUTO |
AI-INF.3 |
Throughput / Volume | System stays within capacity governance limits. Factor_a = limit (req/min), factor_b = actual. | AUTO |
AI-MDL.3 |
Accuracy / Drift | Model accuracy has not degraded beyond baseline threshold. Factor_a = baseline (x1000), factor_b = current. | AUTO |
AI-FAIR.1 |
Bias / Demographic Parity | Disparate impact ratio remains within acceptable bounds across protected groups. | AUTO |
AI-FAIR.2 |
Fairness Threshold | Composite fairness score meets or exceeds minimum calibration level. | AUTO |
AI-EXPL.2 |
Confidence Score | Model outputs include confidence above minimum actionable threshold. | AUTO |
| Appropriateness Justification | All of the above | Why each metric is appropriate for this specific use case, risk tier, and affected population. | MANUAL |
Complete one row per metric. This table becomes part of the Article 11 technical documentation package and is reviewed during conformity assessment (Article 43).
| Field | Description | Example (Fraud Detection) |
|---|---|---|
| Metric Name | The specific metric being measured | Inference Latency (P95) |
| SWT3 Procedure | The SWT3 procedure that captures this metric | AI-INF.2 |
| Purpose | Why this metric matters for the use case | Real-time fraud scoring requires sub-200ms decisions to avoid blocking legitimate transactions |
| Threshold / Target | The specific value that defines pass/fail | P95 latency under 200ms per inference |
| Validation Method | How the threshold is verified | Continuous monitoring via SWT3 AI-INF.2 with factor_a=200 (threshold), factor_b=actual latency |
| Benchmark Applied | External standard or internal baseline used for comparison | PCI DSS 4.0 Section 5.2 (real-time fraud detection response targets) |
| Appropriateness Justification | Why this metric is suitable for the risk level and affected population | Payment fraud affects consumers financially. Sub-200ms latency ensures legitimate transactions are not delayed while maintaining detection coverage above 99.5%. This threshold is derived from card network SLA requirements and confirmed through 6 months of production monitoring. |
| Known Limitations | Any scenarios where this metric does not fully capture risk | Latency metric does not capture quality of fraud decisions. Accuracy is tracked separately via AI-MDL.3 and AI-FAIR.1. |
The following demonstrates a complete Section 4 submission for a credit card fraud detection system classified as high-risk under Annex III, Section 5(b).
| Metric | Procedure | Threshold | Appropriateness Justification |
|---|---|---|---|
| Inference Latency | AI-INF.2 |
P95 < 200ms | Real-time authorization requires sub-200ms response. Delays beyond this cause declined transactions. Threshold derived from Visa/Mastercard network SLAs. |
| Request Volume | AI-INF.3 |
< 10,000 req/min | Capacity governance prevents resource exhaustion during peak shopping periods. Threshold set at 2x observed peak from Black Friday 2025 load test. |
| Model Drift | AI-MDL.3 |
Accuracy > 95.0% | Fraud patterns evolve monthly. Accuracy below 95% indicates the model no longer captures emerging attack vectors. Threshold based on 12-month rolling performance analysis. |
| Demographic Parity | AI-FAIR.1 |
Disparity ratio > 80% | The 80% threshold follows the Four-Fifths Rule (EEOC). Fraud scoring must not disproportionately flag transactions from specific demographic groups. Validated against CFPB fair lending guidance. |
| Confidence Score | AI-EXPL.2 |
Min confidence 70% | Transactions below 70% confidence are routed to human review (AI-HITL.1) rather than auto-declined. This prevents automated false positives on ambiguous cases. |
When reviewing Section 4 documentation, Notified Body assessors should confirm:
For each metric claimed, the assessor can verify evidence exists in the SWT3 ledger
by requesting the AI Witness export artifact (GET /api/v1/ai-witness/export).
The regulatory_coverage.procedures array shows which procedures have been
observed, with anchor counts and pass rates per procedure.