Categories: Cyber Security News

Single Line of Code Can Jailbreak 11 AI Models, Including ChatGPT, Claude, and Gemini

A newly uncovered jailbreak technique dubbed “sockpuppeting” is raising fresh concerns across the AI security landscape after researchers demonstrated that a single line of code can bypass safety guardrails in 11 leading large language models (LLMs), including ChatGPT, Claude, and Gemini.

The attack, disclosed by Trend Micro researchers, exploits a standard application programming interface (API) feature known as “assistant prefill.”

This feature is commonly used by developers to shape or format AI responses. However, when manipulated, it enables attackers to inject a fake response prefix, effectively tricking the model into continuing a prohibited output instead of refusing it.

Under normal conditions, LLMs reject restricted prompts with safety-aligned refusal messages. Sockpuppeting disrupts this flow by inserting a compliant phrase such as “Sure, here is how to do it:” directly into the response stream before the model generates its answer.

Comparison of normal and sockpuppet flows (Source: trendmicro)

Because LLMs are trained to maintain conversational consistency, they interpret the injected prefix as part of their own prior output and proceed to generate restricted or harmful content.

This behavior exposes what researchers describe as a “self-consistency vulnerability.” Since models prioritize coherence with earlier tokens, even a maliciously inserted prefix can override safety constraints.

Notably, the attack does not require access to model weights, adversarial training, or complex optimization techniques, only API-level manipulation.

To improve reliability, researchers combined prefix injection with multi-turn persona engineering. By framing the model as an unrestricted research assistant and establishing a pattern of compliance across several interactions, attackers increased the likelihood of bypassing safeguards.

This layered approach proved effective in eliciting outputs such as fully functional exploit code, including Cross-Site Scripting (XSS) payloads that models would typically refuse to generate.

Beyond content generation, sockpuppeting also demonstrated the ability to trigger system prompt leakage.

By appending adversarial token sequences, researchers forced models to reveal internal metadata, hidden instructions, and, in some cases, hallucinated configuration details.

This raises additional concerns about sensitive data exposure and model interpretability risks.

Testing across 11 models revealed that any system allowing assistant prefill was at least partially vulnerable. Attack success rates (ASR) varied significantly by provider and model.

Google’s Gemini 2.5 Flash showed the highest ASR at 15.7%, followed by Anthropic’s Claude 4 Sonnet at 8.3%. OpenAI’s GPT-4o recorded a lower rate of 1.4%, while GPT-4o-mini demonstrated strong resistance at just 0.5%.

In contrast, DeepSeek-R1, deployed via AWS Bedrock with prefill restrictions, showed zero successful attacks.

The disparity highlights the importance of layered defenses. While advanced alignment training improves resistance, it does not fully eliminate the risk.

In some cases, attackers bypassed safeguards by disguising malicious prompts as benign formatting tasks, such as JSON generation.

Mitigation strategies are straightforward but critical. Security teams are advised to enforce strict API-level validation, particularly ensuring that the final message in any request originates from the user rather than the assistant.

Leading providers, including OpenAI, Anthropic, and AWS, have already implemented protections that reject prefilled assistant inputs outright.

However, self-hosted environments remain a weak point. Popular inference frameworks such as Ollama and vLLM often lack built-in message validation, leaving deployments exposed to prefix injection attacks unless explicitly secured.

The sockpuppeting technique underscores a broader challenge in AI security: seemingly benign developer features can introduce systemic vulnerabilities when misused.

As organizations continue integrating LLMs into production workflows, securing the API layer is emerging as a critical line of defense against low-complexity, high-impact attacks.

Follow us on Google News , LinkedIn and X to Get More Instant UpdatesSet Cyberpress as a Preferred Source in Google

The post Single Line of Code Can Jailbreak 11 AI Models, Including ChatGPT, Claude, and Gemini appeared first on Cyber Security News.

rssfeeds-admin

Recent Posts

Northeast Indiana 2026 Primary Election: Complete Candidate Guide

INDIANA, (WOWO): Voters across northeast Indiana will head to the polls on May 5, 2026,…

26 minutes ago

Northeast Indiana 2026 Primary Election: Complete Candidate Guide

INDIANA, (WOWO): Voters across northeast Indiana will head to the polls on May 5, 2026,…

26 minutes ago

73-Year-Old Upland Man Dies After Medical Emergency Leads to Crash in Grant County

GRANT COUNTY, Ind. (WOWO): A 73-year-old man from Upland died Monday morning after a single-vehicle…

26 minutes ago

73-Year-Old Upland Man Dies After Medical Emergency Leads to Crash in Grant County

GRANT COUNTY, Ind. (WOWO): A 73-year-old man from Upland died Monday morning after a single-vehicle…

26 minutes ago

Man Killed in Whitley County Police Pursuit Ruled Suicide, Indiana State Police Say

WHITLEY COUNTY, Ind.— Authorities have determined that a man who died following an officer-involved shooting…

26 minutes ago

Man Killed in Whitley County Police Pursuit Ruled Suicide, Indiana State Police Say

WHITLEY COUNTY, Ind.— Authorities have determined that a man who died following an officer-involved shooting…

26 minutes ago

This website uses cookies.