Home / AI / Reproducibility: Is Your AI Project Set Up to Fail?

Reproducibility: Is Your AI Project Set Up to Fail?

Artificial intelligence ranks among the top technology priorities for 2026. In fact, 62% of nearly 3,000 digital trust professionals surveyed in ISACA’s 2026 Tech Trends and Priorities Pulse Poll identified AI as a key focus area. However, 75% of respondents admit they are only somewhat prepared—or worse—to manage the risks associated with generative AI through effective governance, policies, and training.

Effective AI governance must prioritize mitigating potential harm arising from AI deployments. These harms can be financial (impacting businesses), economic (affecting policy and markets), or social (impacting individuals and communities). Building a comprehensive enterprise AI governance framework starts with identifying and assessing the full spectrum of risks, and then implementing controls designed to address them appropriately and responsibly.

One significant risk arises when generative AI systems—particularly large language models (LLMs)—produce different outputs for the same prompt. In agentic AI and broader automation workflows, this variability can lead to inconsistent and unpredictable results each time a process runs, potentially creating downstream harm.

Such inconsistency also undermines auditability. If decisions or automated actions powered by LLMs cannot be reliably reproduced, it becomes difficult to validate outcomes, investigate issues, or demonstrate control effectiveness.

So, does this mean your LLM-driven automation initiative is destined to fail? Let’s take a closer look.

First-generation AI governance

Regulatory scrutiny has largely shaped today’s AI governance efforts, with a strong focus on compliance around privacy and security. The emphasis on privacy is logical: AI systems depend on vast volumes of data for training, much of which includes—or can be connected to—personal information. The focus on security is equally justified, as AI platforms introduce new attack surfaces, novel vulnerabilities, and heightened systemic risks that organizations must proactively manage.

Second-generation AI governance

Beyond privacy and security, new drivers of AI governance are gaining prominence. These include ethics, fairness, accountability, and transparency—the core principles of sound governance—along with data quality and environmental sustainability, particularly the energy demands of AI data centers.

At the same time, regulators are expanding expectations around operational risk management and resilience. In the banking sector, for example, Canadian financial institutions must align with requirements outlined in the Office of the Superintendent of Financial Institutions’ Guideline E-21 on operational resilience.

Nascent AI governance: Ensuring reproducibility

Generative AI does not consistently produce the same output for the same prompt. Submit a prompt once and you may receive one response; submit it again and the output can differ. While this variability is valuable for creative applications, it poses challenges for business processes that rely on repeatability—especially where auditability and control validation are essential.

The reproducibility limitations of generative AI arise from multiple underlying factors. A clear understanding of how these systems function is therefore critical when designing governance frameworks, risk management controls, and audit mechanisms during the deployment phase of an AI initiative.

How LLMs are created and how prompts work

At the usage stage of an AI deployment, governance professionals should be mindful of several critical considerations:

  • Veracity – The accuracy, reliability, and truthfulness of AI-generated outputs.
  • Model and data drift – Performance degradation over time as data patterns shift or operating conditions change.
  • Prompt interpretation – Variability in how the model understands and responds to different phrasing or context.
  • Neural network pathways – The probabilistic activation paths within the model that influence output variability.
  • Tunable transformer models – The impact of configurable parameters and fine-tuning choices on system behavior and outcomes.

The development of large language models (LLMs) begins with aggregating vast datasets that span diverse formats, sources, and subject areas. From a data quality standpoint, teams can screen for duplicates, formatting issues, and general relevance. However, verifying the accuracy and truthfulness of data at such scale is far more challenging.

This raises an early governance concern: veracity. When the underlying training data cannot be fully validated, the reliability of model outputs becomes uncertain. In practice, numerous documented cases have shown AI systems generating inaccurate or entirely fabricated responses—highlighting why data integrity must be a core governance priority from the outset.

Once collected, the data is broken down into tokens—individual words, subwords, and punctuation marks—which serve as the building blocks for training the model. These tokens are fed into a neural network architecture, typically a transformer model, that is tuned prior to deployment to produce desired behaviors. Over time, additional data may be incorporated or the model further refined, altering its internal parameters and influencing performance and outputs. This creates a second governance concern: model and data drift. Such drift helps explain why the same prompt can yield different responses as the system evolves.

From the user’s standpoint, everything begins with a prompt. That prompt is similarly tokenized and processed in a way that allows the model to approximate human-like understanding. This introduces a third governance risk: prompt interpretation. Because natural language is inherently ambiguous, subtle differences in phrasing, context, or intent can lead to unexpected or inconsistent outputs.

Given the vast scale of the training data, the neural network generates responses by leveraging its interpretation of the prompt and predicting the most probable next tokens in sequence, assembling them into a coherent output. This probabilistic process introduces a fourth governance concern: the neural network path. Because the model can follow different internal activation pathways when generating responses, outputs may vary from one execution to another. In practice, this becomes evident when users ask the AI to “try again” or provide an alternative answer and receive a different response each time.

The complexity deepens because transformer models are inherently tunable. Some can be configured directly through APIs, while others are adjusted exclusively by the vendor. This tunability introduces a fifth governance concern. For example, parameters such as “temperature” influence whether outputs are more deterministic and fact-driven or more varied and creative. It is essential for end users to understand how these settings—or combinations of settings—are configured, particularly when they do not have direct control over them.

With this foundational understanding, the lack of repeatability in generative AI becomes more understandable—and, to some extent, expected. The more pressing question is what steps can be taken to improve reproducibility in practical deployments.

What to do about reproducibility

There is little that can be done to eliminate the inherently nondeterministic and probabilistic nature of large language models. However, while no solution is foolproof, organizations can take practical steps to reduce variability and improve consistency in LLM-driven process automation and certain agentic AI applications.

The following approaches are largely within the control of end users:

  • Keep prompts clear and unambiguous. Practice strong prompt hygiene by setting context and explicitly defining the role you want the AI to assume. This helps guide the model toward neural pathways most aligned with the intended outcome.
  • Use the same model consistently for the same task. For example, if you rely on GPT-4.1 for a specific workflow, avoid switching models unless necessary. Consistency reduces variability.
  • Retest prompts after model updates. Monitor vendor upgrade notices and validate whether response patterns or behaviors have changed following any modification to the model.
  • Maintain a human in the loop. Periodically review and interpret a representative sample of outputs and automated decisions to ensure quality and appropriateness.

The following control is typically less accessible to end users and may require specialized technical support:

  • Adjust response parameters toward determinism. Configure the model—where possible—to use more fact-based, lower-variability settings rather than highly creative ones.

Implementing these measures helps advance explainability—an essential characteristic of any auditable AI system. Explainable AI focuses on making systems transparent and understandable to humans, ensuring that their outputs can be interpreted, justified, and evaluated in a meaningful way.

Conclusion

If you have concerns about the maturity of your AI governance, a structured, methodical approach can help. Begin by clearly defining policy and intent. Then identify the key risks aligned to your project’s specific objectives, determine appropriate controls with clearly assigned roles and responsibilities, and formally document the supporting processes.

Where primary risks relate to privacy, security, or ethics, there is substantial public-domain guidance available to support the development of AI frameworks, policies, procedures, and other governance artifacts.

However, if your AI initiative is centered on automation, governance rigor must increase. In these cases, special attention is required to address the nondeterministic behavior of LLMs and to implement safeguards that manage variability, ensure oversight, and preserve auditability.

Although steps can be taken to manage LLM unpredictability, it is not advisable—at least for now—to embed such automation within mission-critical operations. Regulatory expectations around reproducibility and control are stringent, and nondeterministic behavior can quickly become a compliance liability.

That said, the outlook is not entirely bleak. Automation and agentic AI can be deployed effectively in use cases where the consequences of variability are lower and risks are more contained. For governance teams, a solid understanding of how LLMs function is essential. It enables the design of proportionate controls that improve consistency and support a defensible level of reproducibility—ideally sufficient to withstand audit scrutiny.

Leave a Reply

Your email address will not be published. Required fields are marked *