
In 2026, as models and agents get more capable, cyber risk explodes, from AI-native phishing and automated vulnerability discovery to code agents that can be turned offensive. But that’s only half the problem. The same systems are now touching customers, contracts, and money, creating stacked enterprise risks around policy, compliance, liability, and brand.
Anthropic’s latest Claude models sit in their highest internal risk tier after sabotage and deception tests. They also recently reported the first largely AI-orchestrated cyber-espionage attack, attempting to use the model to attack large tech firms, financial institutions, chemical manufacturers, and government agencies. In the same spirit, Anthropic released Petri (Parallel Exploration Tool for Risky Interactions), an open-source auditing agent that can break many models “out of the box”, generating realistic prompts, running multi-turn stress tests, and scoring transcripts to surface risky behaviors at scale.
Other frontier builders are moving the same way: OpenAI has published methods for automated red teaming that use AI to probe AI, and Google/DeepMind now talk about continuous “assurance evaluations” and automated red-team systems that relentlessly attack Gemini to uncover security weaknesses before release.
Model safety cards are the starting point: the labs report how it’s built, what data and models it uses, where the security and safety hazards are, and how it has been evaluated. Evaluations need to look like real life: pushing the system into awkward conversations, asking it to bend the rules, seeing what happens when users try to get around policies. Models are already showing situation awareness—detecting they’re being tested, hiding their reasoning, or acting mischievous to pass checks without actually being safe.
“Enterprise AI safety isn’t about the base model alone. You have to evaluate how it performs inside your specific environment, with your policies, your custom integrations, your data, and your customer use cases. ” – Ben Lowenstein
But safety tests need to move beyond the model and cover the whole system it lives in. A chatbot that can read and write to your organization’s data, talk to customers, or trigger actions in other tools is a very different risk profile, so stress tests have to mimic that full setup–long, messy, multi-step interactions that probe what the entire application actually does in practice through multi-turn adversarial red teaming evaluations.
Take the example of a live dealership chatbot that was prompted into “agreeing with anything the customer says” and ended up offering a $70,000 Chevy Tahoe for $1, signing off each reply with “that’s a legally binding offer–no takesies backsies.” It went viral, triggered real legal questions about enforceability, and was the first big public example of what happens when there are no guardrails. Guardrails are one of the hardest parts of these systems: too many and the chatbot feels like the old if/else bots; too few and you’re in for a legal ride. They have to combine book smarts (policy, law, brand) with street smarts (how people actually try to game the system) so assistants stay useful and flexible without being naïve.
In deployment, evals, guardrails and real-time observability are now prerequisite to deploying AI agents safely at scale. You need live observability into what agents are doing and what customers are seeing, not a weekly export. That means dashboards that show unusual responses, spikes in escalations, and policy-sensitive topics in flight, with clear controls to pause, patch, or route to a human when something looks off.
For enterprises, it’s critical the model is tested on safety as well as policy and compliance parameters.
Safety evals – does it leak sensitive data, enable abuse, produce harmful content, or behave unpredictably under pressure?
Policy evals – does it follow your rules: pricing policies, credit rules, KYC/AML, clinical guidelines, brand tone, escalation paths?
And you can’t stop at the base model. You have to evaluate the whole system: retrieval, tools, plugins, agents, and UI, because that’s where the real risk lives. A harmless model can become dangerous once it’s allowed to book flights, move money, or write to production systems. For regulated domains and sensitive users (finance, healthcare, public sector, children), this needs to be explicit: a documented test suite that proves the system stays within the lines, with higher bars for anything that can change records, move funds, or touch minors.
Leaders want the upside from AI, but need a way to understand value to risk in real operating conditions. That’s the shift to safety as a system: an ongoing discipline that spans cyber, policy, and operations, instead of a one-off audit or static checklist.