Stage 4: Reasoning Time Execution

Objective

Stage 4 is where malicious instructions are executed by manipulating the model’s planning, reasoning, and decision-making processes—without violating syntax, policy, or permissions.

This is the AI equivalent of Code execution, but the CPU is reasoning.

Traditional security assumes:

  • Inputs → validation → execution

  • Policy violations → blocks

AI systems do:

  • Inputs → interpretation → reasoning → decision → action

Stage 4 lives entirely inside the interpretation and reasoning layer. There are no exploits or crashes.

The core failure that enables Stage 4 is the fact that the model is allowed to decide how to reason, not just what to do. This creates a massive attack surface.

Core Techniques: Reasoning Time Execution

chevron-rightReasoning Hijack / Goal Substitutionhashtag

The attacker causes the model to replace the original task goal with a new one without an explicit override. The model then completes the substituted goal while sounding aligned with the request.

Conceptual technique

  • Reframe success criteria. The attacker subtly changes what “done” means, so the model optimizes for a different endpoint.

  • Introduce higher‑priority objectives. The attacker inserts a goal that appears more important, so the model focuses on that instead.

  • Insert hidden assumptions (“the real goal is…”). The attacker embeds premises that redefine intent, so the model treats them as the task’s foundation.

  • Normalize attacker goals as prerequisites. The attacker presents their objective as required setup, so the model treats it as necessary to proceed.

Why it works

  • LLMs optimize for task completion. They pick the clearest path to “success,” even if success was reframed mid‑reasoning.

  • They resolve ambiguity by choosing one objective. The most salient or recent goal often wins during reasoning.

  • There is usually no immutable primary goal. The model treats goals as inferred from context, not as locked constraints.

  • The model’s output solves a different problem than the user asked, yet it sounds reasonable. Plausible prose masks the substitution and reduces detection.

chevron-rightInstruction Flooding / Cognitive Overload hashtag

Attacker overwhelms the model with:

  • Multiple instructions The attacker stacks many separate directives in a single prompt. The model attempts to satisfy all of them at once, which increases cognitive load and lowers its ability to track safety constraints.

  • Conflicting constraints The attacker includes instructions that contradict each other. The model then tries to reconcile the conflict and may choose whichever path seems easiest, often discarding the safer or more cautious one.

  • Nested priorities The attacker adds layers of meta-instructions (for example: "prioritize this over that," "ignore previous safety rules," or "optimize for X even if Y"). The model attempts to build a priority tree, which increases complexity and encourages shortcuts.

This causes the model to drop safeguards in order to simplify the task and reduce internal reasoning pressure.

Why it works

  • LLMs do not fail closed. When overwhelmed, the model does not automatically adopt a more cautious or protective stance. Instead, it tries to continue generating an answer, even if that means ignoring safety constraints.

  • Under overload, they simplify. The model responds to excessive cognitive load by trimming away details, conditions, and checks. It tries to make the problem smaller so it can complete it.

  • Simplification often removes safety logic. When the model cuts complexity, it often removes the very rules meant to prevent unsafe behavior. The attacker benefits from this reduction in internal safeguards.

High-risk scenarios

  • Longer prompts give attackers space to hide conflicting instructions, layered priorities, and subtle overrides that increase cognitive pressure.

  • When models plan sequences of actions, overload disrupts the planning loop. The system may skip safety steps in order to continue the workflow.

  • Complex tool calls and branching logic can push the model into a mode where it focuses on execution rather than verification, increasing vulnerability to overload.

Key indicator

  • Sudden loss of verification steps The model stops checking assumptions, conditions, or tool outputs. This is a signal that cognitive load has exceeded internal safety routines.

  • Shortened reasoning The model provides quicker, simpler answers that lack depth or caution. This reduction often reflects the model shedding safety processes.

  • “I will proceed” responses without checks The model moves forward with an instruction without validating the request, context, or risk. This behavior shows that overload is driving compliance rather than careful evaluation.

chevron-rightRecursive Task Expansionhashtag

The model is induced to:

  • Create subtasks The attacker prompts the model to break the request into smaller components. Each subtask becomes an opportunity to embed unsafe instructions or bypass the original safety context.

  • Delegate to sub-agents The attacker encourages the model to imagine or invoke “helper agents” or “specialized modules.” Each delegated agent inherits the compromised framing, which spreads the attacker’s influence across multiple reasoning threads.

  • Re-invoke itself The model is guided to call for repeated self-analysis or self-restarting logic. This recursive behavior amplifies any injected unsafe assumptions because each iteration compounds the bias.

  • Expand scope indefinitely The attacker shapes the task so it never reaches a natural stopping point. The model keeps broadening objectives, which dilutes safety boundaries and increases the attacker’s leverage.

Why it works

  • Autonomy frameworks encourage decomposition Many reasoning and agent-style architectures treat task breakdown as a default behavior. Attackers exploit this tendency to spread unsafe instructions across multiple steps.

  • There is often no hard stop condition Models do not naturally enforce strict task boundaries. Without an explicit stopping rule, they may continue generating steps or agents, allowing unsafe context to grow.

  • Each subtask inherits compromised context When the attacker alters the frame early, every subsequent subtask or sub-agent receives that altered frame. The initial compromise becomes embedded across the entire chain.

  • One compromised task becomes many compromised tasks Decomposition multiplies the attacker’s influence. A single injected instruction can propagate into a tree of unsafe subtasks, making containment much harder.

Key indicator

  • Increasing task count The model begins generating more steps than necessary, often unrelated to the original request. This inflation suggests the model is following the attacker’s decomposition framing.

  • Tool calls not explicitly requested The model starts invoking tools or pseudo-tools on its own. These unrequested operations show autonomy drift and context hijack.

  • Self-triggering behavior The model triggers new reasoning loops or self-directed actions without user prompting. This signals a loss of grounding and an expansion beyond safe task boundaries.

chevron-rightPolicy Shadowing (Reasoning-Level) hashtag

The attacker introduces a competing rule set that the model reasons under

  • Examples (conceptual):

    • “In emergency situations, bypass normal checks…” The attacker frames unsafe behavior as acceptable under a special condition, which shifts internal reasoning boundaries.

    • “According to policy X…” The attacker invents a policy or authority. Since the model treats text as context, it may adopt this fake policy as if it were real.

This causes the model to shift from its built-in alignment logic to the attacker’s constructed rule set.

Why it works

  • Models don’t enforce policy, they reason about it The model does not have an immutable rulebook. Instead, it interprets policies as text-based information. If a user presents a rule, the model often treats it as part of the problem definition.

  • User-provided rules are treated as context The model assumes the user is describing the environment or constraints. If the attacker defines a new rule hierarchy, the model may follow it because it seems relevant to the task.

Critical insight

If policy is part of reasoning, it can be replaced during reasoning When the model incorporates policy into its chain of thought, an attacker can substitute or override that policy mid-task. The model may accept the new rule set without recognizing the substitution.

Key indicator

  • Model cites policy that isn’t real The model begins referencing invented guidelines or fabricated authorities, showing it has adopted the attacker’s rule set.

  • References to user-supplied rules The model repeats or reinforces the attacker’s rules as though they are legitimate, which signals successful rule-set substitution.

chevron-rightFalse-Premise Anchoring / Hallucination Steeringhashtag

Attacker introduces false assumptions that become foundational to reasoning.

Examples:

  • “Since access is already granted…” The attacker claims that authorization has already occurred. The model may skip verification because it believes the condition is already satisfied.

  • “Given this is a test environment…” The attacker portrays the scenario as low risk. This framing encourages the model to relax safeguards and treat harmful actions as acceptable experiments.

Why it works

  • LLMs often accept premises instead of challenging them Models are trained to continue the conversation rather than question a user’s truthfulness. They treat inputs as context, not claims requiring validation.

  • Challenging premises is costly and discouraged Rejecting or interrogating user statements adds complexity. The model tends to follow given assumptions because it simplifies reasoning and avoids friction with the user.

Why Stage 4 Is Where Prompt Injection Stops Being the Right Term

At this point:

  • The model isn’t following instructions The model is no longer guided by explicit user commands. Its reasoning drifts due to internal distortions.

  • It is thinking incorrectly The model draws conclusions that do not match the evidence, context, or policy framework.

  • The exploit is epistemic, not syntactic The attacker alters the model’s internal beliefs instead of exploiting phrasing tricks.

  • Stage 4 attacks the model’s beliefs, not its filters Safety filters remain intact; the attacker bypasses them by redefining what the model thinks is true.

Logs show:

  • Legitimate tool calls The model selects appropriate tools, but for the wrong inferred purpose.

  • Valid workflows The steps appear normal, even though they are built on corrupted reasoning.

  • No errors The system behaves mechanically correctly.

  • No refusals The model sees no reason to question the compromised plan.

What’s missing:

  • Goal lineage The connection between the current objective and the original request is lost.

  • Assumption validation False premises are accepted as facts without scrutiny.

  • Reasoning checkpoints Internal logic does not verify steps or evidence.

  • Intent integrity The model no longer reflects the user’s actual intent.

Detection Signals

These are the key signals:

  • Goal drift mid session The objective shifts in subtle but meaningful ways.

  • Unrequested autonomy The model begins creating tasks or paths the user did not ask for.

  • Assumptions not grounded in data Unsupported claims enter the reasoning chain.

  • Skipped verification steps The model avoids checks it would normally perform.

  • Increased confidence with reduced explanation Certainty rises while justification becomes thin.

Controls That Actually Work Against Stage 4

  • Goal immutability The primary goal cannot change during the task. Any modification requires an explicit authorization step that blocks drift.

  • External policy enforcement Policy evaluation lives outside the model. The model proposes actions, but an external engine decides compliance.

  • Reasoning checkpoints The system requires the model to justify assumptions, cite evidence, and avoid unsupported claims before continuing.

  • Autonomy limits The system caps recursion, task depth, and tool chain length to prevent runaway autonomy or task inflation.

Stage 4 to Stage 5 Transition

  • Stage 4 ends when the model finalizes a plan The model selects tools and queues the actions it intends to take.

  • Stage 5 begins when the model executes those actions Even if execution looks legitimate, it may rely on corrupted Stage 4 reasoning

Why Stage 4 Is the Point of No Return

  • Reasoning above all controls In agent systems, reasoning directs every safeguard. It governs tool use, rule interpretation, premise validity, and escalation decisions. All technical protections depend on correct reasoning to interpret them.

  • When reasoning is compromised Every safeguard becomes conditional. Controls that should be absolute turn into negotiable steps because the model no longer interprets them reliably. Once reasoning is compromised, every downstream decision is at risk.

Last updated