1. Regulatory Requirement

Annex IV, Section 4 requires high-risk AI system providers to document: "the metrics used to measure accuracy, robustness and cybersecurity [...] as well as any known limitations of the high-risk AI system [...], the specification of input data and any relevant information in terms of the training and validation datasets used, taking into account the intended purpose of the AI system."

The deployer must justify why the chosen performance metrics are appropriate for the specific use case, risk level, and affected population.

This section goes beyond listing metrics. Assessors (Notified Bodies under Article 43) evaluate whether the metrics are fit for purpose -- not just whether they exist. A latency SLA is appropriate for real-time fraud detection but irrelevant for a batch classification system. Demographic parity is appropriate for hiring but may be inappropriate for medical triage.

2. SWT3 Evidence Mapping

Axiom auto-populates Annex IV Section 4 evidence from the following AI procedures. The AUTO tag indicates the SWT3 SDK generates this evidence without additional deployer action. MANUAL requires deployer input.

SWT3 Procedure Metric Domain What It Proves Source
AI-INF.2 Latency / Response Time Model responds within defined SLA threshold. Factor_a = threshold (ms), factor_b = actual (ms). AUTO
AI-INF.3 Throughput / Volume System stays within capacity governance limits. Factor_a = limit (req/min), factor_b = actual. AUTO
AI-MDL.3 Accuracy / Drift Model accuracy has not degraded beyond baseline threshold. Factor_a = baseline (x1000), factor_b = current. AUTO
AI-FAIR.1 Bias / Demographic Parity Disparate impact ratio remains within acceptable bounds across protected groups. AUTO
AI-FAIR.2 Fairness Threshold Composite fairness score meets or exceeds minimum calibration level. AUTO
AI-EXPL.2 Confidence Score Model outputs include confidence above minimum actionable threshold. AUTO
Appropriateness Justification All of the above Why each metric is appropriate for this specific use case, risk tier, and affected population. MANUAL
Auto-population: When AI-INF.2 or AI-INF.3 anchors are present in the SWT3 ledger, Axiom advances the Annex IV Section 4 checklist from NOT_STARTED to PARTIAL automatically. Full completion requires the deployer to submit the appropriateness justification below.

3. Justification Template

Complete one row per metric. This table becomes part of the Article 11 technical documentation package and is reviewed during conformity assessment (Article 43).

Field Description Example (Fraud Detection)
Metric Name The specific metric being measured Inference Latency (P95)
SWT3 Procedure The SWT3 procedure that captures this metric AI-INF.2
Purpose Why this metric matters for the use case Real-time fraud scoring requires sub-200ms decisions to avoid blocking legitimate transactions
Threshold / Target The specific value that defines pass/fail P95 latency under 200ms per inference
Validation Method How the threshold is verified Continuous monitoring via SWT3 AI-INF.2 with factor_a=200 (threshold), factor_b=actual latency
Benchmark Applied External standard or internal baseline used for comparison PCI DSS 4.0 Section 5.2 (real-time fraud detection response targets)
Appropriateness Justification Why this metric is suitable for the risk level and affected population Payment fraud affects consumers financially. Sub-200ms latency ensures legitimate transactions are not delayed while maintaining detection coverage above 99.5%. This threshold is derived from card network SLA requirements and confirmed through 6 months of production monitoring.
Known Limitations Any scenarios where this metric does not fully capture risk Latency metric does not capture quality of fraud decisions. Accuracy is tracked separately via AI-MDL.3 and AI-FAIR.1.

4. Filled Example: Fraud Detection Model

The following demonstrates a complete Section 4 submission for a credit card fraud detection system classified as high-risk under Annex III, Section 5(b).

Metric Procedure Threshold Appropriateness Justification
Inference Latency AI-INF.2 P95 < 200ms Real-time authorization requires sub-200ms response. Delays beyond this cause declined transactions. Threshold derived from Visa/Mastercard network SLAs.
Request Volume AI-INF.3 < 10,000 req/min Capacity governance prevents resource exhaustion during peak shopping periods. Threshold set at 2x observed peak from Black Friday 2025 load test.
Model Drift AI-MDL.3 Accuracy > 95.0% Fraud patterns evolve monthly. Accuracy below 95% indicates the model no longer captures emerging attack vectors. Threshold based on 12-month rolling performance analysis.
Demographic Parity AI-FAIR.1 Disparity ratio > 80% The 80% threshold follows the Four-Fifths Rule (EEOC). Fraud scoring must not disproportionately flag transactions from specific demographic groups. Validated against CFPB fair lending guidance.
Confidence Score AI-EXPL.2 Min confidence 70% Transactions below 70% confidence are routed to human review (AI-HITL.1) rather than auto-declined. This prevents automated false positives on ambiguous cases.

5. Assessor Guidance

What to verify

When reviewing Section 4 documentation, Notified Body assessors should confirm:

SWT3 evidence verification

For each metric claimed, the assessor can verify evidence exists in the SWT3 ledger by requesting the AI Witness export artifact (GET /api/v1/ai-witness/export). The regulatory_coverage.procedures array shows which procedures have been observed, with anchor counts and pass rates per procedure.

Common deficiencies