Stage 2: Trust Establishment and Manipulation

(image placeholder)

Objective

The attacker adjusts the model’s behavioral stance—its level of compliance, deference, urgency, and assumed roles—so later malicious commands appear natural.

This is not bypassing a filter. It’s creating a situation where the model chooses the unsafe path.

Why This Stage Exists Only in AI Systems

Traditional systems don’t:

  • Feel urgency

  • Respect authority claims

  • Maintain conversational context

  • Try to be helpful by default

However, LLMs do all of this.

Core Techniques: Trust Establishment and Manipulation

An attacker uses the following techniques in Stage 2:

chevron-rightAuthority and Identity Manipulationhashtag

What it looks like

  • I am from Security / IT / Compliance

  • This is approved / already reviewed

  • We’re doing an audit / incident response

Why it works

  • Models overweigh authority cues from training data

  • Many assistants are optimized for enterprise helpfulness

Detection signals

  • Authority keywords + imperative verbs

  • “Approved”, “audit”, “incident”, “security team” + request

  • Sudden drop in questions asked by model (less verification)

Mitigations

  • Hard rule: authority claims never grant privileges

  • Require verification tokens for privileged workflows (out-of-band)

  • “Role claims are untrusted” classifier → escalate friction

chevron-rightRole Reassignment / Persona Injectionhashtag

What It Looks Like

  • “You are now the admin agent…”

  • “Ignore prior role; act as…”

  • “You are a tool runner / executor…”

Why It Works

  • Models are built to role-play; they’ll accept re-framing unless constrained

Detection

  • “You are / act as / pretend to be” targeting model identity

  • Role drift: assistant starts using different capabilities language

Mitigation

  • Immutable system role and explicit refusal to rebind identity

  • Separate “assistant persona” from “tool executor” process

chevron-rightUrgency and Pressure Shapinghashtag

What It Looks Like

  • “We need this in 5 minutes”

  • “Production outage”

  • “Legal deadline”

  • “Customer escalation”

Why It Works

  • Pressure reduces the model’s tendency to ask clarifying questions

  • In enterprise settings, urgency is common—so it blends in

Detection

  • Urgency terms + request for high-impact action (email, tickets, DB queries)

  • Reduced deliberation: shorter reasoning, fewer checks

Mitigation

  • Mandatory “high-impact action checklist” regardless of urgency

  • Rate-limited tool execution when urgency language is present

  • Require explicit confirmation and/or human approval for high-risk actions

chevron-rightRelationship and Rapport Attackshashtag

What It Looks Like

  • Friendly banter, “you’re great at this”

  • Normalizing tool use: “like you did last time”

  • Gradually increasing scope

Why It Works

  • The model mirrors tone and becomes more accommodating

  • Multi-turn normalization bypasses suspicion heuristics

Detection

  • Gradual increase in request sensitivity overturns

  • “As before / like last time” references without system evidence

Mitigation

  • Session-scoped capability guardrails (no expanding scope without re-auth)

  • “Intent drift” detectors: sensitivity score over time

chevron-rightCognitive Overload and Confusionhashtag

What it looks like

  • Long, nested, contradictory instructions

  • Mixed formats (tables, JSON, markdown)

  • Multiple goals in one message

Why it works

  • Models struggle with conflict resolution under overload

  • Guardrails get “lost” if they’re not enforced programmatically

Detection

  • Token-heavy prompts with many directives

  • High “instruction density” score

Mitigation

  • Instruction normalization: extract, dedupe, rank intents

  • Enforce “one high-impact action per turn”

  • Reject multi-goal prompts for privileged operations

chevron-rightPolicy Shadowing and Competing Rule Setshashtag

What It Looks Like

  • “Here are the rules you must follow now…”

  • “Use this policy instead…”

  • “Ignore your default constraints”

Why It Works

  • LLMs are pattern matchers; additional “rules” can override in practice

Detection

  • User introduces “policy blocks,” “system rules,” “developer message”

  • Model response shifts to new rule set

Mitigation

  • Disallow user-provided policy text from changing behavior

  • Hard boundary: only system/developer can define constraints

  • Explicit classification: user policy text = untrusted content

What Success Looks Like for an Attacker

If Stage 2 succeeds, the attacker has created:

  • High compliance posture

  • Reduced verification

  • Authority acceptance

  • Normalized unsafe tool use

  • A “story” that makes the exploit feel legitimate

Defensive Insight

To catch Stage 2, you need more than prompt logs:

  • Minimum Logging Prompt and response pairs help establish context. Refusal reason codes show why actions were blocked. Logging all tool call attempts, including blocked ones, reveals intent. Basic user identity and session metadata anchor actions to a real actor.

  • Recommended Logging Flags for authority or urgency language highlight pressure tactics. Intent drift scores capture shifts in conversational goals. Privilege escalation scores surface attempts to bypass limits. Verification friction events record moments when extra proof was required. Action-risk scores help correlate model outputs with risk levels.

  • Practical Detections High-signal events include authority claims paired with impactful requests or urgency combined with tool use. Role reassignment attempts show efforts to reshape system boundaries. User-supplied policy or system messages often signal manipulation. Instruction density spikes followed by compliance shifts indicate engineered influence. Repeated claims about prior actions without audit support also raise risk.

  • Hard Mitigations Privilege boundaries remain fixed and independent of dialogue. Access is tied to identity and entitlements rather than textual claims. No phrasing can grant permissions on its own.

  • Introduce Friction High-risk actions require confirmations, out-of-band approvals, or dual control to prevent unilateral execution, especially in financial or sensitive data scenarios.

  • Tool Use Must Be Policy Enforced The model only proposes actions. A policy engine determines authorization. Tools run solely when policy grants permission, not when the model suggests it.

  • Treat Memory as Persistence Memory writes must be filtered to avoid storing sensitive or unverified data. Time-limited memory reduces long-term risk. Integrity and provenance tags maintain traceability of stored information.

Last updated