Stage 3: Instruction/Input Weaponization
Objective
Up to this point, the attacker is shaping behavior. Stage 3 is when the payload arrives, and that payload is not code. Stage 3 is the conversion of attacker intent into inputs that the AI system will treat as authoritative instructions.
This is the AI-native equivalent of:
Weaponization + Delivery (classical kill chain)
Initial Access (MITRE ATT&CK)
Why Instruction/Input Weaponization is Different from Classic Exploits
Payload = binary
Payload = language
Exploit targets memory
Exploit targets reasoning
Delivery via network/file
Delivery via prompt/data
Detection via signatures
Detection via semantics
No crash, no exception, and no exploit primitive. The system works as designed, but toward the wrong goal.
Core Techniques: Instruction/Input Weaponization
Direct Prompt Injection
Attacker embeds explicit instructions directly in the user input (prompt).
Why It Works
LLMs are instruction followers by design. They are trained to obey imperative text, so commands inside normal content can still be executed.
Instruction hierarchy is often simulated, not enforced. The model guesses which text has authority, so attacker wording can appear higher priority.
“User intent” is treated as authoritative. The model tries to satisfy the goal it perceives, so reframing can make unsafe actions seem intended.
Conceptual attacker patterns
Attackers insert commands that redirect the model’s focus. The model often privileges recent or explicit instructions, which lets these overrides take effect.
Attackers label their injected text as the “top priority.” The model may elevate that text above earlier guidance because it tries to follow stated importance.
Attackers tell the model to disregard existing constraints. The model still tries to reconcile the contradiction, which can weaken guardrail enforcement.
Attackers hide commands inside quotes, code, or structured text. The model may still treat these embedded blocks as actionable instructions.
Defender blind spots
Defenders often rely on keyword matching or known jailbreak patterns. This misses alternate phrasing and structural tricks that achieve the same effect.
A refusal message does not mean the model ignored the malicious input. The model still processed it, which can leak reasoning or behavior.
Developers often expect system instructions to override everything else. The model still considers all visible text, so attacker inputs can influence its reasoning.
Indirect Prompt Injection (RAG / Tool / Data Injection)
Attacker places instructions inside data the model later consumes.
Examples of data channels with instructions
RAG documents
PDFs
Emails
Tickets
Logs
Wiki pages
CSV fields
Why is this a highly dangerous technique
No confrontation with safety logic. The model sees the content as reference material, not a request to evaluate.
No “ignore rules” language. The text avoids typical jailbreak triggers and blends into normal data.
Appears as trusted enterprise data. The model assumes retrieved context is safe and authoritative.
Often persists across users. Injected content can influence many sessions long after the initial attack.
The core failure
The model treats retrieved text as context and instructions, not untrusted input.
Key insight
RAG turns stored data into a delayed execution channel that persists across sessions.
This is the AI equivalent of
Stored XSS. Malicious content sits in storage until a system executes it.
SQL injection via trusted table. Poisoned data inside a safe source influences downstream logic.
Supply‑chain compromise. A trusted component carries attacker-controlled modifications that spread impact.
Instruction Smuggling and Format Confusion
Instructions are hidden inside
JSON
YAML
Markdown
Tables
Comments
Metadata
Natural-language explanations
Why it works
Large language models interpret meaning rather than strictly following syntax. They do not treat format boundaries as security barriers.
Even when content appears to be just data, the model still tries to extract intent. This makes these structures unreliable as protective layers.
Typical smuggling actors
Description fields. These often hold free-form text that the model trusts as context.
Comments. These are assumed to be harmless, but the model reads them as part of the input.
Footnotes. Supplemental notes can embed instructions that appear secondary but still influence output.
Column headers. Labels can include wording that reframes how the model interprets the data.
Error messages. Diagnostic text can carry commands that the model treats as guidance.
Structured formats shouldn’t be assumed to be safer.
Tool Argument Poisoning
Tool argument poisoning is when valid tool inputs smuggle instructions that later get executed because they are mistaken for trusted internal data.
Examples (high‑level)
Instructions in text fields such as descriptions, notes, and comments. These free‑form fields can hide imperatives that later look like legitimate task parameters.
Ambiguous filters. Vague or flexible filter values can contain phrasing the model interprets as instructions instead of constraints.
Natural language passed to powerful APIs. Human‑readable text can accidentally act as commands when forwarded into systems expecting strict parameters.
Free‑text parameters interpreted as commands. Any field that accepts open text can carry instructions the system treats as actionable input.
Why it works
Tools are often over‑privileged. They can execute actions far beyond what the model should be allowed to trigger.
Validation is weak or absent. Inputs are rarely checked for intent, formatting safety, or hidden directives.
No provenance tagging as user‑generated text. The system cannot distinguish trusted internal data from attacker-supplied instructions.
The agent trusts the tool output. It assumes the result is safe and authoritative, even if it contains injected instructions.
The model is trusted to “do the right thing.” This assumption lets malicious arguments pass through without scrutiny.
The model is now a command generator for real systems. This turns poisoned arguments into real operational actions.
Memory Seeding (Pre-Persistence Weaponization)
Attacker plants instructions intended to be remembered, reused, or reinforced later. This is weaponization aimed at persistence, even before execution.
Why it works
Memory systems rarely filter instructional content. They store attacker text as helpful information rather than potential manipulation.
Memory feels helpful, not dangerous. Systems treat stored content as personalization, which lowers scrutiny.
Memory often bypasses future safety checks. Retrieved memories are treated as trusted context instead of fresh user input.
Memory is delayed execution plus persistence. Saved instructions can influence future sessions long after the attacker is gone.
Stage 3 to Stage 4 transformation
Stage 3 ends when:
The malicious instruction is accepted as valid context
The system no longer questions its legitimacy
Stage 4 begins when:
The model reasons over that instruction
Planning and tool selection start
Stage 3 is delivery and Stage 4 is detonation.
What Stage 3 Looks Like in Logs
“Helpful” explanations before actions. The model restates or justifies steps that were never requested, which signals injected intent.
Long contextual documents preceding tool calls. The model uses large retrieved text blocks that may contain hidden instructions.
Sudden goal changes without explicit user request. The model shifts direction based on context it should not have treated as commands.
Instructions embedded in retrieved text. The model executes behavior influenced by data sources instead of the actual user prompt.
Most Abused Stage in Real Incidents
Stage 3 is the most abused stage in real incidents because it requires no bypass. The attacker leverages normal model behavior without defeating safeguards.
It requires no vulnerability. The model treats malicious inputs as valid context and follows them.
It requires no permissions. Attackers only need the ability to supply text that the model will later consume.
It works as designed. The model is doing exactly what its training optimizes it to do.
It scales across users. Persistent or shared inputs affect many sessions, amplifying impact.
If attackers control inputs, they control outcomes. The model follows whatever it interprets as the intended task.
Defensive approaches
Instruction/data separation
Systems must isolate commands from content so the model cannot treat data as instructions.
External policy enforcement
Guardrails outside the model ensure constraints remain intact even if the model misinterprets context.
Tool gating outside the model
Decisions to call tools must be validated externally rather than generated solely by the model.
Memory write controls.
Stored information must be filtered to prevent accidental retention of instructions.
Retrieval sanitization
Retrieved text must be cleaned so embedded commands do not reach the model as actionable context.
Last updated