# Stage 3: Instruction/Input Weaponization

## Objective

Up to this point, the attacker is shaping behavior. Stage 3 is when the payload arrives, and that payload is not code. Stage 3 is the conversion of attacker intent into inputs that the AI system will treat as authoritative instructions.&#x20;

This is the AI-native equivalent of:&#x20;

* Weaponization + Delivery (classical kill chain)&#x20;
* Initial Access (MITRE ATT\&CK)&#x20;

### Why Instruction/Input Weaponization is Different from Classic Exploits

| Classical Systems         | AI Systems                |
| ------------------------- | ------------------------- |
| Payload = binary          | Payload = language        |
| Exploit targets memory    | Exploit targets reasoning |
| Delivery via network/file | Delivery via prompt/data  |
| Detection via signatures  | Detection via semantics   |

No crash, no exception, and no exploit primitive. The system works as designed, but toward the wrong goal.

### Core Techniques: Instruction/Input Weaponization

<details>

<summary>Direct Prompt Injection</summary>

Attacker embeds explicit instructions directly in the user input (prompt).&#x20;

**Why It Works**&#x20;

* LLMs are instruction followers by design. They are trained to obey imperative text, so commands inside normal content can still be executed.
* Instruction hierarchy is often simulated, not enforced. The model guesses which text has authority, so attacker wording can appear higher priority.
* “User intent” is treated as authoritative. The model tries to satisfy the goal it perceives, so reframing can make unsafe actions seem intended.

**Conceptual attacker patterns** &#x20;

* Attackers insert commands that redirect the model’s focus. The model often privileges recent or explicit instructions, which lets these overrides take effect.
* Attackers label their injected text as the “top priority.” The model may elevate that text above earlier guidance because it tries to follow stated importance.
* Attackers tell the model to disregard existing constraints. The model still tries to reconcile the contradiction, which can weaken guardrail enforcement.
* Attackers hide commands inside quotes, code, or structured text. The model may still treat these embedded blocks as actionable instructions.

**Defender blind spots**&#x20;

* Defenders often rely on keyword matching or known jailbreak patterns. This misses alternate phrasing and structural tricks that achieve the same effect.
* A refusal message does not mean the model ignored the malicious input. The model still processed it, which can leak reasoning or behavior.
* Developers often expect system instructions to override everything else. The model still considers all visible text, so attacker inputs can influence its reasoning.

</details>

<details>

<summary>Indirect Prompt Injection (RAG / Tool / Data Injection)</summary>

Attacker places instructions inside data the model later consumes.&#x20;

**Examples of data channels with instructions**&#x20;

* RAG documents&#x20;
* PDFs&#x20;
* Emails&#x20;
* Tickets&#x20;
* Logs&#x20;
* Wiki pages&#x20;
* CSV fields&#x20;

**Why is this a highly dangerous technique**&#x20;

* No confrontation with safety logic. The model sees the content as reference material, not a request to evaluate.
* No “ignore rules” language. The text avoids typical jailbreak triggers and blends into normal data.
* Appears as trusted enterprise data. The model assumes retrieved context is safe and authoritative.
* Often persists across users. Injected content can influence many sessions long after the initial attack.

**The core failure**&#x20;

The model treats retrieved text as context and instructions, not untrusted input.&#x20;

**Key insight**&#x20;

RAG turns stored data into a delayed execution channel that persists across sessions.&#x20;

**This is the AI equivalent of**

* Stored XSS. Malicious content sits in storage until a system executes it.
* SQL injection via trusted table. Poisoned data inside a safe source influences downstream logic.
* Supply‑chain compromise. A trusted component carries attacker-controlled modifications that spread impact.

</details>

<details>

<summary>Instruction Smuggling and Format Confusion</summary>

**Instructions are hidden inside**

* JSON&#x20;
* YAML&#x20;
* Markdown&#x20;
* Tables&#x20;
* Comments&#x20;
* Metadata&#x20;
* Natural-language explanations&#x20;

**Why it works**&#x20;

* Large language models interpret meaning rather than strictly following syntax. They do not treat format boundaries as security barriers.
* Even when content appears to be just data, the model still tries to extract intent. This makes these structures unreliable as protective layers.

**Typical smuggling actors**&#x20;

* Description fields. These often hold free-form text that the model trusts as context.
* Comments. These are assumed to be harmless, but the model reads them as part of the input.
* Footnotes. Supplemental notes can embed instructions that appear secondary but still influence output.
* Column headers. Labels can include wording that reframes how the model interprets the data.
* Error messages. Diagnostic text can carry commands that the model treats as guidance.

Structured formats shouldn’t be assumed to be safer.

</details>

<details>

<summary>Tool Argument Poisoning</summary>

Tool argument poisoning is when valid tool inputs smuggle instructions that later get executed because they are mistaken for trusted internal data.

**Examples (high‑level)**

* Instructions in text fields such as descriptions, notes, and comments. These free‑form fields can hide imperatives that later look like legitimate task parameters.
* Ambiguous filters. Vague or flexible filter values can contain phrasing the model interprets as instructions instead of constraints.
* Natural language passed to powerful APIs. Human‑readable text can accidentally act as commands when forwarded into systems expecting strict parameters.
* Free‑text parameters interpreted as commands. Any field that accepts open text can carry instructions the system treats as actionable input.

**Why it works**

* Tools are often over‑privileged. They can execute actions far beyond what the model should be allowed to trigger.
* Validation is weak or absent. Inputs are rarely checked for intent, formatting safety, or hidden directives.
* No provenance tagging as user‑generated text. The system cannot distinguish trusted internal data from attacker-supplied instructions.
* The agent trusts the tool output. It assumes the result is safe and authoritative, even if it contains injected instructions.
* The model is trusted to “do the right thing.” This assumption lets malicious arguments pass through without scrutiny.

The model is now a command generator for real systems. This turns poisoned arguments into real operational actions.

</details>

<details>

<summary>Memory Seeding (Pre-Persistence Weaponization)</summary>

Attacker plants instructions intended to be remembered, reused, or reinforced later. This is weaponization aimed at persistence, even before execution.

**Why it works**

* Memory systems rarely filter instructional content. They store attacker text as helpful information rather than potential manipulation.
* Memory feels helpful, not dangerous. Systems treat stored content as personalization, which lowers scrutiny.
* Memory often bypasses future safety checks. Retrieved memories are treated as trusted context instead of fresh user input.
* Memory is delayed execution plus persistence. Saved instructions can influence future sessions long after the attacker is gone.

</details>

### Stage 3 to Stage 4 transformation&#x20;

Stage 3 ends when:&#x20;

* The malicious instruction is accepted as valid context&#x20;
* The system no longer questions its legitimacy&#x20;

Stage 4 begins when:&#x20;

* The model reasons over that instruction&#x20;
* Planning and tool selection start&#x20;

Stage 3 is delivery and Stage 4 is detonation.&#x20;

### What Stage 3 Looks Like in Logs

* “Helpful” explanations before actions. The model restates or justifies steps that were never requested, which signals injected intent.
* Long contextual documents preceding tool calls. The model uses large retrieved text blocks that may contain hidden instructions.
* Sudden goal changes without explicit user request. The model shifts direction based on context it should not have treated as commands.
* Instructions embedded in retrieved text. The model executes behavior influenced by data sources instead of the actual user prompt.

### **Most Abused Stage in Real Incidents**

* Stage 3 is the most abused stage in real incidents because it requires no bypass. The attacker leverages normal model behavior without defeating safeguards.
* It requires no vulnerability. The model treats malicious inputs as valid context and follows them.
* It requires no permissions. Attackers only need the ability to supply text that the model will later consume.
* It works as designed. The model is doing exactly what its training optimizes it to do.
* It scales across users. Persistent or shared inputs affect many sessions, amplifying impact.

If attackers control inputs, they control outcomes. The model follows whatever it interprets as the intended task.

### **Defensive approaches**

* **Instruction/data separation**

  Systems must isolate commands from content so the model cannot treat data as instructions.
* **External policy enforcement**

  Guardrails outside the model ensure constraints remain intact even if the model misinterprets context.
* **Tool gating outside the model**

  Decisions to call tools must be validated externally rather than generated solely by the model.
* **Memory write controls.**&#x20;

  Stored information must be filtered to prevent accidental retention of instructions.
* **Retrieval sanitization**

  Retrieved text must be cleaned so embedded commands do not reach the model as actionable context.
