These Are Not the AML Models You Are Used to Validating

Large Language Models (LLMs) are not the kind of "models" AML programs have historically validated, and treating them as such causes practical failures.

Silhouetted robots with labels comparing transaction monitoring models, name matching algorithms, and large language models in AML.
When different systems get grouped under the same word.

Two of the most scary words for Anti-money laundering (AML) professionals when considering new technology, MODEL VALIDATION.

AML teams evaluating the use of large language models (LLMs) in AML workflows often encounter friction when the topic of model validation arises. The source of that friction is usually classification.

The word model carries specific meaning in AML. Once applied, it pulls the technology into model risk processes built for a different type of system.

What AML model validation was built to govern

In AML programs, a model is usually a quantitative system that directly drives an outcome. It produces a score, classification, or match result that leads to an alert, a risk rating, or a block.

Common examples include transaction monitoring logic, sanctions matching algorithms, and customer risk scoring.

Validation focuses on these systems because they:

  • Directly affect outcomes
  • Behave in a consistent and predictable way
  • Are trained, tuned, and controlled by the institution
  • Have clear inputs and outputs

Even though guidance describes a "model" as the model plus its use, validation work usually concentrates on the decisioning logic. That is where risk is introduced.

How LLMs are actually used in AML today

Most LLM deployments in AML do not sit directly in the final decision path. They are used to support work, not replace it. Common uses include:

  • Summarizing alerts and cases
  • Reviewing negative news
  • Supporting investigations
  • Drafting SAR narratives

The final decisions remain with existing rules, policies, and human reviewers. The LLM produces text or analysis that someone else reviews or uses. In these setups, the LLM does not have final authority over outcomes.

Some LLM uses in AML do involve judgment. For example, an LLM may run searches for adverse media, review articles, and highlight what appears relevant. In these cases, the LLM is making intermediate decisions about relevance and prioritization.

AML processes already rely on similar judgment from analysts. The risk profile changes when that judgment reduces visibility or cannot be overridden. If analysts can review sources, see what was included or excluded, and make the final call, the LLM is supporting judgment rather than exercising authority.

The governance question is not whether the LLM makes judgments, but whether those judgments can change outcomes without human visibility or control.

Where LLM behavior really comes from

LLM behavior in AML systems is influenced by how the system is built around it, including:

  • Prompt structure
  • Retrieved context
  • Tool access and permissions
  • Guardrails and filters
  • Where humans review or approve outputs
  • How outputs are used by downstream systems

The model weights alone do not define behavior, and this is where validation starts to miss the target. LLM are non-deterministic and outputs can also vary across runs, even when given similar inputs. That variability is shaped and controlled by system design choices, including prompts, context, permissions, and review controls.

Why validating the LLM alone adds little value

Traditional AML validation works when the model itself is the main risk driver. With LLM-based systems, risk emerges from how multiple components interact.

Meaningful review focuses on questions such as:

  • End-to-end workflow behavior
  • What actions are blocked?
  • Where can errors occur?
  • How are those errors detected?
  • Change control across prompts, retrieval sources, and model versions

Testing the LLM by itself still has value. Teams can assess:

  • Accuracy on defined tasks
  • Consistency after updates
  • Exposure to misuse or manipulation
  • Handling of sensitive data

These checks help, but they do not address most real-world risk. The risk lives in how the system is used.

Regulator expectations have not fundamentally changed

Regulators generally care about:

  • Who is accountable for decisions
  • How controls are designed
  • Whether risk is contained
  • How changes are managed
  • Whether systems behave as expected in practice

These expectations already apply to vendors, platforms, and AML operations as a whole. Problems arise when LLM-based workflows are forced into validation structures designed for scoring models, rather than reviewed as controlled systems.

A more workable framing for AML teams

For most LLM uses in AML, the object under review is the workflow that produces analyst-facing or system-consumed outputs. The LLM is one component within that workflow.

Validation evidence should demonstrate what the system is allowed to influence, what it cannot do, how errors are detected, where humans intervene, and how changes are reviewed and approved.

The challenge with LLMs in AML is applying the wrong controls that slow innovation and still don't address risk.


Also may be interesting: Read How AI Adjudication is Reducing False Positives in Watchlist Screening