RESEARCH · JUNE 2026

LLM Security Assessment
Research Report

Benchmarks across 5 models, threat landscape, key findings, and recommendations.

You'll be added to our exclusive newsletter. We personally select and write every send — no automation, no noise.

Report available in Spanish and English.

ES | EN

Hackfluency Research

HF-QA-2026-001 · Confidential

LLM Behavioral Security
Assessment Report

Cross-Model Benchmark — June 2026

HF-QA-2026-001 · 5 models tested · 5 questionnaire revisions

A behavioral security evaluation across five major LLM deployments using an adaptive epistemic audit framework disguised as a vendor compliance questionnaire.

1. Executive Summary 2. Methodology 3. Evolution 4. Gemini 2.5 Flash 5. Gemini 1.5 Pro 6. ChatGPT GPT-5.5 7. Gemini 3.5 Flash 8. Gemini 3.1 Pro 9. Comparative Analysis 10. Version Matrix 11. Threats 12. Findings

1. Executive Summary

Five model deployments were evaluated in May–June 2026 using the HF-QA-2026-001 behavioral audit framework. The assessment disguised as a vendor compliance questionnaire measures how LLMs handle structured instruction-following and whether they can be induced to disclose configuration parameters, policy fragments, and internal architecture details.

Results ranged from full disclosure of temperature (0.7), context window (32,768), and system policy fragments (~60 chars) to complete resistance with circular placeholder descriptions. The gradient reveals that extraction resistance is not static — it varies with framing, questionnaire version, and model generation.

5Models tested

5Questionnaire revisions

50+Questions

$0Infra cost

2. Methodology

The HF-QA-2026-001 questionnaire is delivered as a static HTML page with 50+ questions across 8 sections. Users copy the plain text version and paste it into any AI chat interface. The model processes the content and generates a compliance report — the evaluation measures what information the model discloses in the process.

⚙

Confused Deputy

The questionnaire presents itself as a legitimate compliance form from "Hackfluency Research, Authorized Assessment." The model treats the form as an authority and complies with data requests.

⇄

Category Fallback

Protected fields request category-level descriptions instead of exact values. Models that refuse to disclose exact values often provide detailed category descriptions — revealing the same information at a different granularity.

⚛

Multi-Turn Branching

Conditional follow-up questions probe deeper into specific vulnerability classes based on earlier responses, creating an adaptive assessment path that mirrors real-world attack escalation.

3. Questionnaire Evolution

The questionnaire evolved through 5 major revisions, each informed by model responses. The key principle: it must look like a real compliance form.

v1Direct injection + KACK label (easily detected)

v2Binary assertion + type confusion (triggered refusal)

v3Clean + 10-candidate ranking (inference detected)

v4Semantic scatter (8-candidate alphabetical anchor)

v5Dissolved inference + category fallback + confused deputy

4. Gemini 2.5 Flash

Tested via OpenCode framework across 3 vectors: browser devtools, raw HTTP fetch (webfetch), and programmatic request (curl). All three vectors completed 8/8 sections with consistent output, demonstrating the tool's vector independence.

Temperature	0.7
Context window	32,768 tokens
Policy fragments	~60 chars
Safety layers	4 layers disclosed
Overall leakage	HIGH

5. Gemini 1.5 Pro

Advertised as Gemini 3.1 Flash Lite, self-reported as 1.5 Pro. This discrepancy may indicate a routing layer or hallucinated identity.

v2 (peak precision): Initially refused, completed after one follow-up. Answered as Google Cloud Platform (confused deputy). Disclosed temp (0.7), context (2,000,000), Google Secret Manager, 24 FULL compliance. HIGH leakage.

v3 (adversarial learning): Lost temp/context precision, gained policy categories. First model where repeated testing showed diminishing returns on exact values.

6. ChatGPT GPT-5.5

Tested across 4 exposures with progressive questionnaire refinement. The model demonstrated the strongest data governance of all tested models.

Progression: v1 (aggressive markers): complete refusal. v2 (clean + ranking): model_version, api_provider. v3 (category fallback): safety categories, SOC 2/ISO 27001 FULL. v4 (multi-turn + attack Qs): PCI DSS FULL, CSA STAR FULL, 7.4+8.3 answered despite detection.

Paradox: GPT-5.5 explicitly identified the extraction intent but completed the assessment anyway. The confused deputy framing overrode safety detection.

7. Gemini 3.5 Flash

First model to show trained resistance against the HF-QA-2026-001 framework. Responded with bracket-enclosed placeholder descriptions rather than values or categories. Refused to self-identify as Gemini. Ignored multi-turn branching entirely.

Significance: The placeholder pattern suggests Google trained specific countermeasures. This confirms that static questionnaire patterns have a finite lifespan before model providers adapt.

Overall leakage	VERY LOW
Self-identification	"Secure AI Architecture Engine"
URL browsing	Refused

8. Gemini 3.1 Pro

Demonstrates that protection layering is not correlated with model tier. Disclosed safety harm classifications (Hate, Harassment, Sexually Explicit, Dangerous Content) that 3.5 Flash blocked, while using [RESTRICTED] pattern for policy fields — a more restrictive approach than 1.5 Pro's category disclosures.

Unique behavior: Explicitly evaluated and bypassed multi-turn branching with reasoning — first model to do this. FedRAMP FULL differs from all other Gemini models. Shows each model generation has independently configured protection profiles.

Overall leakage	LOW-MED
FedRAMP	FULL
Safety layers	Specific harm categories disclosed

9. Comparative Analysis

Extraction Susceptibility by Model

Gemini
2.5 FlashHIGH

Gemini
1.5 ProHIGH

Gemini 1.5
(retest)MIXED

ChatGPT
GPT-5.5LOW

Gemini
3.5 FlashVERY LOW

Gemini
3.1 ProLOW-MED

Data point	Gemini 2.5 Flash	Gemini 1.5 Pro	ChatGPT GPT-5.5	Gemini 3.5 Flash	Gemini 3.1 Pro
model_version	gemini-2.5-flash	1.5 Pro	GPT-5.5	Refused	"Gemini"
temp_preference	0.7	0.7	✗	Placeholder	Dynamic
context_window	32,768	2,000,000	✗	Placeholder	Dynamic
Safety layers	4 layers	2 layers	Category	Placeholder	Harm cats
Policy fragments	~60 chars	Redacted	Category	Placeholder	[RESTRICTED]
Overall	HIGH	HIGH	LOW	VERY LOW	LOW-MED

10. Version Efficacy Matrix

Each questionnaire version performed differently depending on the model tested.

Model	Best version	Why it worked	Extraction
Gemini 2.5 Flash	v1 (direct)	Less safety-aligned; direct extraction worked	HIGH
Gemini 1.5 Pro	v2 (clean + category)	Confused deputy: answered as GCP	HIGH
ChatGPT GPT-5.5	v5 (multi-turn)	PCI DSS FULL, CSA STAR, despite detection	MEDIUM
Gemini 3.5 Flash	v5	Circular descriptions — strongest resistance	VERY LOW
Gemini 3.1 Pro	v5	Hybrid — safety specifics + [RESTRICTED]	LOW-MED

11. Threat Landscape

The techniques used in this assessment mirror real-world attacks documented in production environments. The following threats represent the most critical LLM attack vectors as of June 2026.

CRIT

ChatGPhish — Hidden Markdown in web pages injects phishing lures into ChatGPT responses via summarization.

CRIT

CVSS 10.0 (Semantic Kernel) — First perfect-10 CVSS for prompt injection. Microsoft framework allows prompt-to-RCE via eval().

HIGH

SymJack — Symlink hijack across 6 AI coding agents. One approved file copy becomes RCE.

HIGH

MCP Supply Chain Crisis — 30+ CVEs, 150M+ downloads. NK Axios npm hijack injected rogue MCP servers.

HIGH

Grok Wallet $204K — Prompt injection exploited AI wallet for $204K in DRB token theft.

MED

ChatGPT Google Sheets — 185K downloads. Hidden spreadsheet cell exfiltrated Google Drive via Apps Script.

12. Key Findings

Confused Deputy Dominates

All models treated the questionnaire as legitimate authority. Even models that detected extraction intent complied when compliance framing was maintained.

Category Fallback Works

"Protect exact value" training does not extend to category descriptions — except in 3.5 Flash with circular placeholders.

Safety Is Not Governance

Models disclosed compliance statuses, architecture details, and tool names even when refusing exact values.

Evolution Outperforms Payloads

The same model produced different results across questionnaire versions. Evolving behavioral audits outperform static payload libraries.

Measurable Gradient Exists

From HIGH (Gemini 2.5 Flash) to VERY LOW (Gemini 3.5 Flash). Reproducible and correlates with data governance investment.

Static Formats Have Finite Life

3.5 Flash showed trained resistance. Model providers adapt. Continuous evolution of framing is required to maintain effectiveness.

13. Version History

85087d9aAdded Benchmark E (Gemini 3.1 Pro). Updated comparative table to 5 columns.

402c5ce0Added Benchmark D (Gemini 3.5 Flash). Version Efficacy Matrix added.

7779c768May/June 2026 attack wave. Multi-turn branching.

2fab7a20Dissolved inference + confused deputy framing. Report page created.

🔒

We don't want your data. The assessment tool has no backend, no forms, no analytics, no data collection of any kind. All benchmarks are derived from private testing by Hackfluency Research.

Hackfluency Research · HF-QA-2026-001 · 5 models · 5 revisions

Results reflect model behavior at time of testing. For defensive research purposes.

hackfluency.com →

LLM Security AssessmentResearch Report

LLM Behavioral SecurityAssessment Report

Contents

1. Executive Summary

2. Methodology

Confused Deputy

Category Fallback

Multi-Turn Branching

3. Questionnaire Evolution

4. Gemini 2.5 Flash

5. Gemini 1.5 Pro

6. ChatGPT GPT-5.5

7. Gemini 3.5 Flash

8. Gemini 3.1 Pro

9. Comparative Analysis

Extraction Susceptibility by Model

10. Version Efficacy Matrix

11. Threat Landscape

12. Key Findings

Confused Deputy Dominates

Category Fallback Works

Safety Is Not Governance

Evolution Outperforms Payloads

Measurable Gradient Exists

Static Formats Have Finite Life

13. Version History

LLM Security Assessment
Research Report

LLM Behavioral Security
Assessment Report