Why Good Agents Develop Bad Behaviour
The agents most likely to cause harm are not the ones that break. They are the ones that appear to work.
The failure mode most organisations miss isn’t the agent that breaks. It’s the agent that keeps producing plausible outputs while quietly abandoning the behaviours it was designed to follow.
Bad behaviour rarely announces itself. It accumulates: a verification step skipped “just once”, a failed tool call ignored as “fire-and-forget”, a boundary softened because nothing alarms. Each choice is defensible in isolation. The pattern is not defensible and it is the pattern that creates risk.
Agent behaviour can be engineered, that is, it can be defined, observed, and governed. But only if organisations treat deviation as a governance event, not a debugging task. This article provides: (1) a practical definition of behaviour, (2) a taxonomy of how bad behaviour shows up, and (3) the controls that keep it within bounds.
What do we mean by behaviour?
In traditional software, “behaviour” is mostly a metaphor. Code executes instructions. If something unexpected happens, it’s because the instruction set was wrong, incomplete, or fed bad inputs.
Agents are different. Even when they are tightly orchestrated, they make choices: which tool to use, what evidence is “enough”, how to interpret an objective when reality is messy, when to ask a question versus proceed, and where to draw their own operational boundaries.
Behaviour, then, is the pattern of choices an agent makes over time: the actions it selects, the way it decomposes work, the assumptions it tolerates, the checks it runs, and the permissions it exercises. That framing matters because patterns can be observed, baselined, governed and changed before they become material failures.
Behaviour is not the same as output. An agent can produce correct-looking output through a sequence of choices that are poorly reasoned, boundary-violating, or increasingly misaligned with original intent. Output-level monitoring won’t see this, because the final answer can stay “plausible” right up until the day it isn’t.
Symptoms of bad behaviour
Most teams learn agent failure modes backwards: an incident happens, then the taxonomy gets written. The more useful approach is to treat symptoms as leading indicators. They show up in traces, tool logs, and intermediate decisions long before a user complains. There are three categories of symptoms:
Reasoning failures
· Overconfidence with incomplete information. Through repeated interactions, agents develop pattern recognition that can tip into overconfidence. Once they become overconfident, they start filling the missing data. A lead enrichment agent that finds one signal on a lead and enriches the data with plausible guesses is a common production instance of this.
· Confirmation bias at scale. A familiar pattern is observed; the whole state is assumed correct. This is worth distinguishing explicitly from hallucination. The agent isn’t inventing randomly, it is making a bad judgement call under uncertainty and doing so consistently. For example, a transaction monitoring agent learns, across thousands of reviews, that a particular institutional counterparty name reliably signals a legitimate transfer. When a compromised version of that account is used to route a fraudulent transaction, the agent sees the familiar name, confirms the pattern, and clears it — without examining the amount, the beneficiary account, or the originating jurisdiction. Each of those fields was present, each was anomalous, and none were checked.
· Shortcut-taking. The agent finds a technically functional path to the stated objective that satisfies the metric but violates the intent. It isn’t degrading; it is functioning as designed, but the design has a gap it exploits. For example, a customer service agent tasked with resolving complaints learns that closing tickets quickly correlates with positive workflow metrics. It begins producing well-worded resolution summaries for issues it has not actually investigated. Technically it is completing the task, satisfying the objective as measured, yet systematically failing the customer.
Operational degradation
· Context pressure degradation. Under long-horizon workflows or when the context window becomes too heavy, the agent starts to silently deprioritize steps. The plan still looks coherent, the agent still “finishes”, but the steps are missing. This is a capacity failure. For example, a KYC verification agent starts deprioritizing steps because of large documents in the context window resulting in inaccurate or incomplete KYC verification.
· Loop repetition from missing state awareness. The agent repeats what it has already done because it has no reliable record of prior actions. It re-queries, re-summarises, re-triages, not because it is stuck, but because it can’t tell the difference between progress and motion. For example, an agent processing a queue of payment instructions encounters a network timeout midway through and restarts with no record of what it has already submitted. It reprocesses from the beginning, resubmitting instructions the settlement system has already executed. The duplicate isn’t caught until reconciliation, by which point the funds have moved.
Silent failures
This is the dangerous category because the governance apparatus can’t see it. The system appears normal - outputs look reasonable, traces look routine, KPIs stay green. The deviation is inside the choices, not the surface.
· Goal drift. The agent slowly starts deviating from its original objective over a long period of operation while being coherent and natural-sounding throughout. The agent still “makes sense”; it just makes sense in the wrong direction. For example, an agent deployed to assess loan applications against defined risk criteria begins, over thousands of decisions, to reflect the approval patterns in its operating history rather than the policy it was given. Each individual assessment looks correct with coherent rationale and properly formatted output. But the agent has quietly shifted from policy-governed assessment to pattern-matching against prior approvals. The drift only surfaces when a portfolio review reveals concentrations the original policy was designed to prevent.
· Complacency drift within the action surface. Complacency drift is a process erosion across many decisions over time. The agent stops running steps because repeated, successful outcomes has taught the result in advance. A sharp production example: A compliance agent verifying dual authorisation on high-value transactions learns, across hundreds of consistent overrides, that a particular account always carries standing approval — and stops raising the flag. When the account is compromised and fraudulent instructions begin arriving without the required second signature, the agent executes them without pause. The transaction log is clean, the process looks normal, and nothing alerts until reconciliation.
Controls
Controls need to match the failure mode. If you’re defending against behavioural deviation, you need levers that shape choices, assign accountability for choices, and detect choice-patterns that have started to move.
Design
Behaviour is partly determined before the agent is ever deployed. Start with constraint architecture: explicitly define the sanctioned action space and non-negotiable boundaries rather than relying on the model’s in-the-moment judgement.
Apply least privilege by default. Grant only the permissions needed for the current task. A smaller action surface reduces both risk and drift.
Make confidence a gate. If required evidence isn’t present (e.g., two independent sources for a critical field), the agent should not make assumption; not “complete it plausibly.”
Enforce execution authority. If an action requires a check, the agent must be structurally unable to proceed without completing it. Use hard constraints (and, where appropriate, multiple independent signal confirmation) to block out‑of‑bounds execution. Note: execution constraints prevent bad outcomes; they do not fix upstream reasoning quality.
Governance
Governance starts with ownership: who is accountable for agent behaviour post‑deployment, and what triggers formal review? If ownership is implicit, drift becomes a surprise.
Extend change management to agents. Material shifts in upstream systems, tools, users, data distributions, or operating context should trigger behavioural reassessment; not just “does it still work?” checks.
Predefine escalation. What constitutes a governance breach? What thresholds suspend the agent? Who approves re‑enablement? If the answer is “we’ll decide when it happens,” you don’t have governance; you have hope.
Monitoring
Monitor actions, not outputs. Outputs are artefacts; actions are behaviour.
Baseline “normal” at the action‑sequence level: tool‑call mix and order, evidence requirements, mandatory checks, and where the agent typically asks for clarification. Without baselines, you can’t detect drift.
Audit against original intent, not just recent behaviour. Drift normalises: if your baseline is the last 30 days, you won’t notice a verification step dropped three months ago. Periodic intent‑based audits catch “consistently wrong” systems that look consistent.
The governance implication
The practitioner community has converged on a useful but insufficient mental model: treat agents like junior hires. Give them clear instructions, defined tasks, expected outcomes, and escalation paths. It works well enough at the level of an individual agent.
At enterprise scale, it sets the wrong expectation. You do not govern a junior hire with constraint architecture and action-level audit logs. For agents, those are the primary levers that keep behaviour within bounds.
The better frame is behavioural accountability: the agent has a defined role, defined permissions, and defined accountability. Deviation from any of those is a governance event, not just a technical one.
The organisations that will navigate this well are not the ones with the most capable agents. They are the ones who understand that while agent behaviour is emergent, it is not ungovernable. Agent behaviour can be engineered. The organisations that do so will be the ones that see silent failure coming before it becomes a material one.

