Shipping LLM Agents in Regulated Science
Everyone’s demo works. The distance between a demo and a system a quality organization will sign off on is the entire job — and it is not where most teams are looking.
A vendor demo dazzles a room of scientists in twenty minutes. Eighteen months later the same tool still is not in a validated workflow, and the model did not get worse in the meantime. The demo answered the wrong question. It proved the system can produce the answer. Deployment asks something narrower and far harder: can you defend that answer to an auditor two years from now, when the batch it informed is already in patients?
I have spent the last several years building production AI for pharma and biotech — document AI, Digital QC, microscopy, cell-line and ADMET work — on top of harmonized scientific data. The single most consistent lesson is that in regulated science the engineering problem inverts. General agent engineering optimizes for capability and autonomy: make the model smarter, give it more tools, let it act. Regulated science keeps those on the table but subordinates them to a different stack of constraints — provenance, determinism, validation, and an unbroken chain of human accountability. The hard part was never making the agent smarter. It is making it auditable, reproducible, and bounded enough that a named human can put their signature under its output.
Here is what that actually changes, constraint by constraint.
The deliverable is not an answer — it is a defensible answer
In consumer AI, a wrong answer is a bad experience. In QC review, submission analysis, or batch release, a wrong answer is a deviation — a regulatory event with a paper trail and consequences that escalate toward the patient. The cost function is not just higher, it is asymmetric: a confidently wrong answer is far more expensive than an abstention.
That reframes hallucination entirely. It is not an annoyance to be reduced by a few points on a benchmark; it is a failure mode to be engineered against structurally. Grounding, citation to source, and end-to-end traceability stop being polish and become the product. You are not really shipping a generator. You are shipping an evidence chain that happens to be assembled by a model, and you build the rest of the chain to carry the weight the model cannot.
Validation, not evaluation
This is the gap most teams coming from a pure ML background underestimate. ML culture says: measure on a held-out benchmark, ship, monitor, iterate. Regulated science says: qualify the system before it touches a GxP workflow — computer system validation under a framework like GAMP 5, with installation, operational, and performance qualification (IQ/OQ/PQ) documented and signed.
The genuine tension is obvious once you state it. How do you validate a system that, by design, can give a different output to the same input? The answer is that you do not validate the generation. You validate the envelope: pinned model versions, deterministic decoding settings, frozen prompts, a bounded and whitelisted set of tools, retrieval over a controlled corpus, and guardrails whose behavior is deterministic and therefore testable. The creative core stays probabilistic; everything around it is made boring on purpose. In regulated AI the model is the part you constrain, not the part you trust.
The audit trail is a first-class output
21 CFR Part 11 and the data-integrity principles behind it — ALCOA+: attributable, legible, contemporaneous, original, accurate, plus complete, consistent, enduring, available — were not written with agents in mind. But agents now act inside the very systems those rules govern. Every action an agent takes has to be attributable and reconstructable after the fact. “Who decided this?” must have an answer, and “the model did” is not an acceptable one.
This is why human-in-the-loop in regulated science is a legal control, not a UX preference. The review step, the sign-off, the escalation path — these are part of the system architecture, designed in from the start, not a confirmation dialog bolted on at the end. The agent proposes; an accountable human disposes; and the trail records both, in a form that survives the person, the project, and the vendor.
“I don’t know” is a feature you have to build
The highest-leverage design decision in this whole space is making abstention a first-class output. Under an asymmetric cost regime, an agent that reliably knows the edge of its own competence is worth more than one that is a couple of points more accurate on average and fails silently when it leaves familiar ground.
Concretely that means retrieval grounding that refuses when the evidence is not present rather than improvising; structured outputs in place of free text wherever a schema can carry the meaning; confidence thresholds that route to a human instead of guessing; and treating “needs review” as a successful terminal state, not a failure. This is the instinct that teams arriving from consumer AI most often have backwards, where abstention reads as the model giving up. In a GxP context it is the model behaving correctly.
Change control versus “continuous improvement”
The ML reflex is to improve quietly: tweak a prompt, swap in a better model, ship on Friday. Inside a validated system, each of those is a change-control event — potentially triggering revalidation. The convenience that makes modern AI fast to iterate is in direct tension with the governance that makes it deployable, and pretending otherwise is how pilots quietly die in month nine.
So you architect for it. Version everything — model, prompt, tool definitions, retrieval index — as a single qualified configuration. Freeze the model behind a contract rather than riding a vendor’s rolling endpoint. Demand explicit no-training-on-your-data guarantees. None of this is hypothetical: the FDA’s own Claude-based assistant, Elsa, runs in a FedRAMP High environment with an explicit no-training-on-industry-data guarantee. That posture is not the cautious exception in regulated AI; it is the template.
The agent is only as trustworthy as the lineage beneath it
This is the part the model-centric conversation keeps skipping. The limiting reagent in regulated AI is rarely model quality; it is harmonized, lineage-tracked data. An agent reasoning over fragmented, un-provenanced instrument output cannot produce a defensible answer no matter how capable the model is — because the defense was always going to rest on the data’s provenance, and there is none to point to.
The adoption gradient that industry surveys keep reporting tells the same story from the other side: AI uptake is steep for “read this for me” tasks and shallow for “design this for me,” and every CIO already knows data fragmentation is the real bottleneck. The agent layer is downstream of the data-integrity layer. Build the second one first, or the first one has nothing to stand on.
Why this is familiar territory
None of this discipline is new to me, and that is the point I would make to anyone weighing whether the skill set transfers. Before pharma I built a real-time brain–machine interface that had to make a decoding decision in roughly a millisecond, where a wrong call at the wrong moment silently corrupted an experiment that took months to set up. You learn very quickly to engineer for the failure mode rather than the demo — to instrument everything, to make the system’s own uncertainty legible, and to design the boundaries before the capabilities. Regulated AI is the same engineering discipline in a different jurisdiction. The actuator changed from a laser to a batch record; the rule that being wrong has to be expensive-by-design did not.
- Regulated-AI systems thinking — CSV / GAMP 5, 21 CFR Part 11, ALCOA+
- Production ML under asymmetric cost, with abstention as a design primitive
- Data-lineage and provenance architecture as the foundation, not an afterthought
- End-to-end delivery — from real-time C++ systems to deployed scientific AI
The real moat
The frontier model is becoming a commodity; capability is converging and will keep converging. The teams that win in regulated science will not be the ones with a marginally better model. They will be the ones who treated provenance, validation, and abstention as the architecture rather than the afterthought — who understood that in this domain the boundary is the product. Everyone’s demo works. The work begins where the demo ends.
Written by Igor Gridchyn, PhD — Senior Applied AI/ML Engineer building production scientific AI for pharma and biotech, and author of the first brain–machine interface for assembly-specific memory disruption (Neuron, 2020). Ongoing analysis of applied AI in the sector lives at the AI in Pharma & Biotech newsletter. For hiring, collaboration, or a deeper technical conversation, the contact form is the cleanest path.