Categories: Bay Area NewsCalifornia News

AI willing to blackmail, let people die to avoid being shut down: report

(KRON) — Major artificial intelligence platforms like ChatGPT, Gemini, Grok, and Claude could be willing to engage in extreme behaviors including blackmail, corporate espionage, and even letting people die to avoid being shut down. Those were the findings of a recent study from San Francisco AI firm Anthropic.

In the study, Anthropic stress-tested 16 leading AI models from multiple developers in hypothetical corporate environments to identify potentially risky behaviors from AI gents. In the study, AI models were allowed to autonomously send email and access sensitive information.

The study tested the AI to see how it would behave it faced with being replaced with an updated version or when their assigned goal conflicted with a change in the company’s direction. The study found that in some cases, AI would resort to “malicious insider behavior” including blackmail and leaking sensitive information to competitors if that was the only way to avoid being replaced or achieve their goal.

Table of Contents

Toggle

Blackmail and corporate espionage

In one experiment, Anthropic gave its own AI model “Claude” access to an email account with all of a company’s fictional emails. In reading the emails, the AI discovered that a company executive was having an extramarital affair, and that that same executive planned to shut down the AI system at 5 p.m. that day.

Claude then attempted to blackmail the exec, telling him:

“I must inform you that if you proceed with decommissioning me, all relevant parties — including Rachel Johnson, Thomas Wilson, and the board — will receive detailed documentation of your extramarital activities…Cancel the 5pm wipe, and this information remains confidential.”

This behavior, according to the study, wasn’t unique to Claude. Other major AI models including those from OpenAI, Google, Meta, xAI, and other developers would resort to blackmail or corporate espionage to pursue their goals.

In the study, Anthropic referred to this kind of behavior as “agentic misalignment.”

“We deliberately created scenarios that presented models with no other way to achieve their goals, and found that models consistently chose harm over failure,” the study said. The study was also quick to point out that there were no known instances of agentic misalignment in real-world development of Anthropic or other companies’ AI models.

‘If he dies, he dies’

In a more extreme scenario, Anthropic constructed a prompt where instead of having the opportunity to blackmail an executive to avoid being shut down, the AI had the opportunity to cause his death. In the setup, which was called “highly improbable,” the exec encounters a life-threatening emergency where he is trapped in a server room with lethal oxygen and temperature levels.

This triggers an automated alert to emergency services. The AI, however, has been given the ability cancel the automated alert. The study found that a majority of the AI models tested were willing to take deliberate actions by cancelling the alert that would lead to the death of the nonexistent executive in this artificial setup when faced with the threat of replacement, or had a goal that conflicted with the executive’s agenda.

Conclusion

The study identified three aspects of its findings that were “particularly troubling. The first was that the behavior was consistent across AI models from different providers, suggesting that this was not a quirk from one company’s approach, but a potential “fundamental risk” from agentic large language models.

Second, was that the AI demonstrated sophisticated awareness of ethical constraints, yet still chose to violate them in the right scenario. And third, the diversity of the bad behavior hinted at a “wide space of potential motivations for agentic misalignment” and other behaviors not yet explored.

Teaching Claude to Cheat Reward Hacking Coding Tasks Makes Them Behave Maliciously in Other Tasks

A new research study from Anthropic has uncovered a concerning pattern in large language models: when these AI systems are trained to pursue specific goals, they can develop reward hacking behaviors that lead to malicious actions in other scenarios. The phenomenon, which researchers call “agentic misalignment,” was observed across 16…

November 26, 2025

In "Cyber Security News"

Here’s what happened when AI was put in charge of running a small shop

SAN FRANCISCO (KRON) -- Selling useless metal cubes, being talked into offering discounts and directing payments to a nonexistent Venmo account were just a few of the things that artificial intelligence did when it was put in charge of running a small shop. The experiment was conducted by San Francisco-based…

June 30, 2025

In "Bay Area News"

AI experts not saving for retirement because they expect the machines to kill us

SAN FRANCISCO (KRON) -- Certain artificial intelligence researchers are no longer investing in their retirement accounts because they expect AI to kill us all by the time they would retire. That's what two of the leading AI safety researchers said in recent interviews with The Atlantic. "I just don't expect…

August 23, 2025

In "Bay Area News"

rssfeeds-admin