Annex IV Section 4: Performance Metrics Appropriateness

1. Regulatory Requirement

Annex IV, Section 4 requires high-risk AI system providers to document: "the metrics used to measure accuracy, robustness and cybersecurity [...] as well as any known limitations of the high-risk AI system [...], the specification of input data and any relevant information in terms of the training and validation datasets used, taking into account the intended purpose of the AI system."

The deployer must justify why the chosen performance metrics are appropriate for the specific use case, risk level, and affected population.

This section goes beyond listing metrics. Assessors (Notified Bodies under Article 43) evaluate whether the metrics are fit for purpose -- not just whether they exist. A latency SLA is appropriate for real-time fraud detection but irrelevant for a batch classification system. Demographic parity is appropriate for hiring but may be inappropriate for medical triage.

2. SWT3 Evidence Mapping

Axiom auto-populates Annex IV Section 4 evidence from the following AI procedures. The AUTO tag indicates the SWT3 SDK generates this evidence without additional deployer action. MANUAL requires deployer input.

SWT3 Procedure	Metric Domain	What It Proves	Source
`AI-INF.2`	Latency / Response Time	Model responds within defined SLA threshold. Factor_a = threshold (ms), factor_b = actual (ms).	AUTO
`AI-INF.3`	Throughput / Volume	System stays within capacity governance limits. Factor_a = limit (req/min), factor_b = actual.	AUTO
`AI-MDL.3`	Accuracy / Drift	Model accuracy has not degraded beyond baseline threshold. Factor_a = baseline (x1000), factor_b = current.	AUTO
`AI-FAIR.1`	Bias / Demographic Parity	Disparate impact ratio remains within acceptable bounds across protected groups.	AUTO
`AI-FAIR.2`	Fairness Threshold	Composite fairness score meets or exceeds minimum calibration level.	AUTO
`AI-EXPL.2`	Confidence Score	Model outputs include confidence above minimum actionable threshold.	AUTO
Appropriateness Justification	All of the above	Why each metric is appropriate for this specific use case, risk tier, and affected population.	MANUAL

Auto-population: When AI-INF.2 or AI-INF.3 anchors are present in the SWT3 ledger, Axiom advances the Annex IV Section 4 checklist from NOT_STARTED to PARTIAL automatically. Full completion requires the deployer to submit the appropriateness justification below.

3. Justification Template

Complete one row per metric. This table becomes part of the Article 11 technical documentation package and is reviewed during conformity assessment (Article 43).

Field	Description	Example (Fraud Detection)
Metric Name	The specific metric being measured	Inference Latency (P95)
SWT3 Procedure	The SWT3 procedure that captures this metric	AI-INF.2
Purpose	Why this metric matters for the use case	Real-time fraud scoring requires sub-200ms decisions to avoid blocking legitimate transactions
Threshold / Target	The specific value that defines pass/fail	P95 latency under 200ms per inference
Validation Method	How the threshold is verified	Continuous monitoring via SWT3 AI-INF.2 with factor_a=200 (threshold), factor_b=actual latency
Benchmark Applied	External standard or internal baseline used for comparison	PCI DSS 4.0 Section 5.2 (real-time fraud detection response targets)
Appropriateness Justification	Why this metric is suitable for the risk level and affected population	Payment fraud affects consumers financially. Sub-200ms latency ensures legitimate transactions are not delayed while maintaining detection coverage above 99.5%. This threshold is derived from card network SLA requirements and confirmed through 6 months of production monitoring.
Known Limitations	Any scenarios where this metric does not fully capture risk	Latency metric does not capture quality of fraud decisions. Accuracy is tracked separately via AI-MDL.3 and AI-FAIR.1.

4. Filled Example: Fraud Detection Model

The following demonstrates a complete Section 4 submission for a credit card fraud detection system classified as high-risk under Annex III, Section 5(b).

Metric	Procedure	Threshold	Appropriateness Justification
Inference Latency	`AI-INF.2`	P95 < 200ms	Real-time authorization requires sub-200ms response. Delays beyond this cause declined transactions. Threshold derived from Visa/Mastercard network SLAs.
Request Volume	`AI-INF.3`	< 10,000 req/min	Capacity governance prevents resource exhaustion during peak shopping periods. Threshold set at 2x observed peak from Black Friday 2025 load test.
Model Drift	`AI-MDL.3`	Accuracy > 95.0%	Fraud patterns evolve monthly. Accuracy below 95% indicates the model no longer captures emerging attack vectors. Threshold based on 12-month rolling performance analysis.
Demographic Parity	`AI-FAIR.1`	Disparity ratio > 80%	The 80% threshold follows the Four-Fifths Rule (EEOC). Fraud scoring must not disproportionately flag transactions from specific demographic groups. Validated against CFPB fair lending guidance.
Confidence Score	`AI-EXPL.2`	Min confidence 70%	Transactions below 70% confidence are routed to human review (AI-HITL.1) rather than auto-declined. This prevents automated false positives on ambiguous cases.

5. Assessor Guidance

What to verify

When reviewing Section 4 documentation, Notified Body assessors should confirm:

Each metric has a clear link to the intended purpose of the AI system
Thresholds are derived from standards, benchmarks, or empirical analysis -- not arbitrary
Known limitations are acknowledged and mitigated (e.g., by complementary metrics)
The SWT3 procedure cited actually produces evidence for the claimed metric
Continuous monitoring is in place (not just point-in-time testing)

SWT3 evidence verification

For each metric claimed, the assessor can verify evidence exists in the SWT3 ledger by requesting the AI Witness export artifact (GET /api/v1/ai-witness/export). The regulatory_coverage.procedures array shows which procedures have been observed, with anchor counts and pass rates per procedure.

Common deficiencies

Metrics without justification: Listing accuracy and latency without explaining why those specific metrics (and thresholds) are appropriate for the risk level
Missing known limitations: Every metric has blind spots. Failure to document them suggests incomplete risk analysis
No benchmark reference: Thresholds should trace to industry standards, regulatory guidance, or documented empirical analysis
Static-only validation: One-time test reports are insufficient. Continuous monitoring evidence (SWT3 anchors over time) demonstrates ongoing compliance