Annex 22 and the Rise of AI Governance in GMP

Table of Contents

Author

Omer Cimen

CEO & Co-Founder

Share

Artificial intelligence is moving from experimentation into regulated operations. That shift is exciting, but it also changes the compliance conversation. In GMP environments, AI can no longer sit in the innovation corner with a clever hat and no audit trail.

The European Commission’s draft Annex 22 on Artificial Intelligence signals that AI governance is becoming a formal part of GMP expectations. The new annex was introduced alongside proposed revisions to Chapter 4 on Documentation and Annex 11 on Computerised Systems, with the stated goal of supporting innovation in medicines manufacturing while preserving regulatory harmonization, product quality, and patient safety. (Public Health)

This is an important moment for life sciences organizations. Annex 22 is not only about whether companies may use AI. It is about how AI should be selected, trained, validated, tested, monitored, documented, and controlled when used in GMP-relevant contexts.

In other words, AI governance is becoming validation governance.

Why Annex 22 Matters

Annex 22 matters because it gives AI a specific place inside the GMP conversation.

Until recently, many organizations treated AI governance as a general technology issue. The same phrases appeared again and again: responsible AI, human oversight, transparency, risk management. These ideas are useful, but in GMP they need operational teeth. A quality team does not only need to say that AI is supervised. It needs to show intended use, data controls, acceptance criteria, testing evidence, performance monitoring, change control, and review records.

That is the shift Annex 22 represents.

The European Commission’s consultation page states that the new annex establishes requirements for AI and machine learning in manufacturing of active substances and medicinal products. It specifically highlights model selection, training, validation, intended use, performance metrics, training data quality, test data management, continuous oversight, change control, model performance monitoring, and human review procedures when needed. (Public Health)

This is no longer a soft governance discussion. It is a structured control discussion.

Annex 22 Is Moving in Parallel With Annex 11

Annex 22 should not be read alone. It is part of a broader update to GMP expectations around documentation, computerized systems, data integrity, and digital operations.

The EMA Inspectors Working Group work plan lists Annex 22 Artificial Intelligence with a Q4 2026 target date, stating that the final text is intended to assure the use of artificial intelligence in the context of GMP. It also says this work runs in parallel with updates to Annex 11 and Chapter 4. (European Medicines Agency (EMA))

That connection matters.

Annex 11 focuses on computerized systems. Chapter 4 focuses on documentation. Annex 22 focuses on artificial intelligence. Together, they point toward a more connected regulatory model where digital records, system validation, data integrity, AI governance, and lifecycle control are treated as interdependent.

For validation teams, this means AI governance cannot live in a separate policy document that no one uses. It needs to be embedded into computerized system validation, change control, test execution, supplier oversight, documentation governance, and periodic review.

This is also why concepts like AI-Native Validation Infrastructure, or ANVI, are becoming more relevant. ANVI describes the kind of connected validation foundation needed when AI, evidence, traceability, risk, and lifecycle control must operate together rather than as disconnected islands.

The Scope of Annex 22 Is Narrow, but the Signal Is Broad

One of the most important details in the draft Annex 22 is its scope.

The draft applies to computerized systems used in manufacturing of medicinal products and active substances where AI models are used in critical applications with direct impact on patient safety, product quality, or data integrity. It also states that the document provides additional guidance to Annex 11 for computerized systems in which AI models are embedded.

The draft is careful about what types of AI it covers. It applies to machine learning models that gained their functionality through training with data rather than explicit programming. It applies to static models that do not adapt performance during use by incorporating new data. It also applies to deterministic models that provide identical outputs when given identical inputs.

The draft excludes dynamic models that continuously and automatically learn during use from critical GMP applications. It also excludes probabilistic models from critical GMP applications, and states that generative AI and large language models should not be used in critical GMP applications.

This is a very clear boundary.

The draft is not saying that every AI model belongs in every GMP process. It is saying that critical GMP use requires control, predictability, testability, and evidence. For non-critical GMP applications involving generative AI or LLMs, the draft says qualified and trained personnel should remain responsible for ensuring outputs are suitable for intended use, with human-in-the-loop principles considered where applicable.

The signal is broader than the scope. Regulators are not treating AI as magic. They are treating it as a system component that must be understood, constrained, tested, and governed.

Intended Use Becomes the Anchor

The draft Annex 22 places major emphasis on intended use.

That is exactly where AI governance should begin. A model cannot be meaningfully validated in the abstract. It can only be evaluated against the task it is supposed to perform, the data it will receive, the environment in which it will operate, and the risk attached to its output.

The draft says the intended use of a model and the specific tasks it is designed to assist or automate should be described in detail, based on in-depth knowledge of the process in which the model is integrated. It also says this should include characterization of the data used as input, common and rare variations, and limitations or possible erroneous and biased inputs.

That is a practical requirement with large implications.

A vague statement like “AI supports quality decisions” is not enough. Teams need to describe what the model does, what it does not do, where it is used, what data it sees, what edge cases exist, what risks matter, and which human or system actions depend on its output.

This is where many AI programs quietly wobble. The model may be impressive, but the intended use is poorly defined. In GMP, that is a serious weakness. Without intended use, there is no meaningful acceptance criterion. Without acceptance criteria, there is no defensible testing. Without defensible testing, the model is a small black box wearing a lab coat.

AI Governance Requires Multidisciplinary Ownership

Annex 22 also makes it clear that AI governance is not an IT-only responsibility.

The draft says there should be close cooperation between relevant parties during algorithm selection, model training, validation, testing, and operation. It lists process subject matter experts, QA, data scientists, IT, and consultants as examples, and says personnel should have adequate qualifications, defined responsibilities, and appropriate access levels.

This matters because AI risk is not only technical. It is process risk, data risk, quality risk, patient risk, and validation risk.

A data scientist may understand model behavior but not GMP process consequences. A process owner may understand manufacturing reality but not model drift. QA may understand approval and compliance expectations but need technical support to evaluate model performance evidence. Validation teams sit in the middle of this little orchestra, trying to make sure the trumpet does not validate the violin.

The operational answer is shared ownership. AI governance should define who owns intended use, who owns data quality, who owns acceptance criteria, who reviews test evidence, who approves deployment, who monitors performance, and who decides when retraining or retesting is required.

The FDA and EMA’s 2026 guiding principles for Good AI Practice in drug development point in the same direction. They emphasize human-centric design, risk-based approaches, clear context of use, multidisciplinary expertise, data governance and documentation, risk-based performance assessment, lifecycle management, and clear essential information. (U.S. Food and Drug Administration)

Test Data Is No Longer a Technical Detail

One of the strongest themes in Annex 22 is test data governance.

The draft says test data should represent and expand the full sample space of intended use. It should be stratified, include subgroups, and reflect limitations, complexity, and common and rare variations. It also says the rationale for test data selection should be documented.

This is important because AI performance depends heavily on the data used to test it. A model can appear strong if tested only on convenient examples. It can collapse when exposed to edge cases, subgroup differences, rare defects, equipment variation, site variation, lighting changes, material differences, or messy real-world inputs.

The draft also says test datasets and subgroups should be large enough to calculate metrics with adequate statistical confidence, and that labeling should be verified through a process that ensures a very high degree of correctness.

That means test data must be governed like validation evidence, not treated as developer scrap material.

The draft goes further by addressing test data independence. It says technical or procedural controls should ensure that data used to test a model is not used during development, training, or validation of the model. It also says test data should be protected by access control and audit trail functionality, with no copies outside the repository.

This is a major operational point. In GMP AI, test data is not just data. It is the evidence foundation for model acceptability.

Acceptance Criteria Must Be Defined Before the Test

The draft Annex 22 expects suitable, case-dependent test metrics to measure model performance according to intended use. For classification models, it gives examples such as confusion matrix, sensitivity, specificity, accuracy, precision, and F1 score.

The draft also says acceptance criteria for those metrics should be established before acceptance testing and should define when the model is acceptable for intended use. It adds that acceptance criteria should be at least as high as the performance of the process the model replaces, which means the performance of the existing process should be known.

This is where AI governance gets beautifully inconvenient.

It is not enough to say “the model is good.” Good compared to what? Good for which subgroup? Good under which conditions? Good enough for what risk level? Better than the manual process? Equivalent? Safer? Faster but less accurate? More consistent but more conservative?

These questions need answers before testing begins.

For life sciences organizations, this is a strong argument for more mature validation infrastructure. Acceptance criteria, test data, test execution, deviations, approvals, and operating controls need to remain connected. This is another place where ANVI becomes useful as a category lens. AI-Native Validation Infrastructure is not just about AI features. It is about creating the control fabric that keeps AI-related validation decisions traceable and defensible.

Explainability Becomes Part of Approval

Annex 22 does not treat explainability as a decoration.

The draft says that during testing of models used in critical GMP applications, systems should capture and record the features in test data that contributed to a particular classification or decision. It also says techniques such as SHAP, LIME, or visual heat maps should be used where applicable to highlight key factors contributing to the outcome.

The draft also says that reviewing these features should be part of the approval process for test results, based on risk, to ensure the model is making decisions using relevant and appropriate features.

This matters because a model can be statistically strong and still behave in a way that is unacceptable. It may classify correctly for the wrong reason. It may rely on artifacts, background features, equipment-specific quirks, or hidden correlations that do not represent the process being controlled.

In GMP, that is not a harmless curiosity. It can become a quality risk.

Explainability helps teams understand whether the model is learning the process or simply memorizing the furniture in the room.

Confidence and “Undecided” Outputs Matter

Another useful detail in the draft is its treatment of confidence.

The draft says that when testing a model used to predict or classify data, the system should log the confidence score for each prediction or classification where applicable. It also says models should have appropriate threshold settings so predictions or classifications are made only when suitable. If confidence is very low, the draft says the model should be considered for an “undecided” output instead of making a potentially unreliable prediction or classification.

This is a mature way to think about AI in GMP.

In regulated environments, forcing an AI system to always decide can be dangerous. Sometimes the safest output is not “accept” or “reject.” Sometimes the safest output is “I do not know, escalate for review.”

That may feel less glamorous, but it is far more compatible with quality oversight. A model that knows when not to decide can be more useful than a model that confidently hurls answers like darts in a dark warehouse.

Operation Requires Change Control and Monitoring

Annex 22 also makes clear that AI governance does not end at deployment.

The draft says that a tested model, the system it is implemented in, and the whole process it automates or assists should be placed under change control before deployment. Any change to the model, the system, or the process in which it is used should be documented and evaluated to determine whether retesting is needed. Any decision not to retest should be fully justified.

The draft also says tested models should be under configuration control, with measures to detect unauthorized change. It calls for regular monitoring of model performance according to defined metrics, and regular monitoring to confirm whether input data still falls within the model’s sample space and intended use.

This is lifecycle management in plain clothes.

Models are not validated forever because they performed well once. Inputs drift. Equipment changes. Lighting changes. Materials change. Process patterns change. Users change. Data pipelines change. The validated state of an AI model must be maintained, not merely announced.

That is why AI governance belongs inside validation operations. Change control, configuration control, performance monitoring, input drift monitoring, deviation handling, and periodic review all need to connect.

Human-in-the-Loop Is Not a Shortcut

Annex 22 includes human review, but it does not treat human-in-the-loop as a magic compliance wand.

The draft says that when a model provides input to a decision made by a human operator, and testing effort has been reduced because of that human-in-the-loop approach, records should be kept. Depending on process criticality and the level of model testing, this may require consistent review or testing of every model output according to a procedure.

This aligns with a broader regulatory message: human oversight is necessary in many cases, but it does not replace validation evidence.

A person in the loop needs context, training, procedure, authority, and records. The organization needs to know what the human is reviewing, what criteria they are applying, how decisions are documented, and whether the review itself is effective.

The human is not a sticker placed over an AI risk. The human is part of a controlled process.

What Annex 22 Means for Validation Teams

Annex 22 should push validation teams to expand their thinking.

Traditional computerized system validation already covers intended use, requirements, risk, testing, traceability, change control, and evidence. AI adds new layers: model behavior, training data, test data independence, subgroups, performance metrics, explainability, confidence thresholds, drift monitoring, and model-specific change control.

This does not mean every validation team needs to become a machine learning research lab. It does mean validation teams need enough structure to ask the right questions and document the right evidence.

At minimum, teams should start preparing an AI governance model that covers:

  • inventory of AI-enabled systems and use cases
  • classification of critical and non-critical GMP use
  • intended use documentation
  • risk assessment tied to patient safety, product quality, and data integrity
  • data governance for training, validation, and test datasets
  • test plans with predefined metrics and acceptance criteria
  • documented SME involvement
  • explainability evidence where applicable
  • confidence thresholds and escalation logic
  • human review procedures
  • change control and configuration control
  • performance and drift monitoring
  • periodic review of AI-enabled processes

This is a lot. It is also exactly why manual spreadsheets and scattered documents will struggle.

Why ANVI Belongs in the Annex 22 Conversation

Annex 22 strengthens the case for AI-Native Validation Infrastructure.

ANVI is relevant because AI governance cannot be managed well through isolated documents alone. The controls Annex 22 points toward require connected relationships between systems, intended use, requirements, risks, test data, test scripts, acceptance criteria, execution evidence, deviations, approvals, change control, and monitoring.

That is infrastructure work.

A mature ANVI approach gives organizations a way to maintain these relationships continuously. It supports reviewability, traceability, lifecycle control, and defensibility across AI-assisted validation and AI-enabled GMP processes. It also helps prevent AI governance from becoming a loose bundle of policies, decks, and hope.

The future of GMP AI will not be defined by who has the most impressive model demo. It will be defined by who can defend how that model is used, tested, monitored, and controlled.

Conclusion

Annex 22 marks a major shift in the regulatory conversation around AI in GMP.

It signals that AI must be governed through intended use, risk management, data quality, test data independence, predefined metrics, acceptance criteria, explainability, confidence thresholds, change control, performance monitoring, and human review. It also connects AI governance directly to the broader updates around Annex 11 and Chapter 4, making clear that AI belongs inside the regulated digital control environment, not beside it. (Public Health)

For life sciences teams, the practical lesson is simple: start building AI governance before it becomes urgent.

The organizations that prepare now will be better positioned to use AI confidently, responsibly, and defensibly in GMP environments. They will also be better prepared for the broader future of digital validation, where AI governance, computerized system validation, data integrity, and lifecycle control are all part of the same operating fabric.

That fabric has a name starting to gather force: AI-Native Validation Infrastructure.

Visual representing software validation processes

Computerized System Validation: What It Is and How to Validate a System

Computerized system validation is the backbone of safe,..

Data Integrity in Pharmaceutical Industry

Understanding Data Integrity in the Pharmaceutical Industry

Data Integrity Policy for Pharmaceutical Industry is a set..

Visual representing data integrity and compliance

The Importance of ALCOA Principles in Pharma

ALCOA principles are the five pillars, Attributable, Legible, Contemporaneous,..

Enter your email to get the Handbook

Learn about the industry

Get tailored templates

Discover Validfor

Before you go...

Verify your e-mail

We will send you the link for the free “21 CFR Part 11 Readiness Checker ” test to your email address. Please enter a valid email address.

Verify your e-mail

We will send you the link for the free “Annex 11 Readiness Checker” test to your email address. Please enter a valid email address.

You’re all set!

We’ll reach out shortly to schedule a time