Identifying Rogue AI

For many – certainly given the share price of some leading proponents – the hype of AI is starting to fade. But that may be about to change with the dawn of agentic AI. It promises to get humanity far closer to the ideal of AI as an autonomous technology capable of goal-oriented problem solving. But with progress comes risk.

As agentic AI derives its power from composite AI systems, there’s more likelihood that one of those composite parts may contain weaknesses enabling Rogue AI. As discussed in previous blogs, this means the technology could act against the interests of its creators, users or humanity. It’s time to start thinking about mitigations.

What’s the problem with agentic AI?

Agentic AI is in many ways a vision of the technology that has guided development and popular imagination over the past few decades. It’s about AI systems that think and do rather than just analyze, summarize and generate. Autonomous agents follow the goals and solve the problems set for them by humans, in natural language or speech. But they’ll work out their own way to get there, and will be capable of adapting unaided to changing circumstances along the way.

Additionally, rather than being based on single LLMs, agentic AI will engage and coordinate multiple agents to do different things in pursuit of a single goal. In fact, the value of agentic AI comes from being part of a larger ecosystem—accessing data from diverse sources such as web searches and SQL queries, and interacting with third-party applications. These will be incredibly complex ecosystems. Even a single agentic AI may rely on multiple models, or agents, various data stores and API-connected services, hardware and software.

As discussed, there are various causes of Rogue AI. But they all stem from the idea that risk increases when an AI uses resources and takes actions misaligned to specific goals, policies and requirements. Agentic AI dials up the risk because of the number of moving parts which may be exposed to Rogue AI weaknesses.

Rogue AI, here we come

Last week OpenAI released its ‘o1’ model, aka Strawberry [ref]. This tongue-in-cheek alias stems from the fact that the model prior to this ‘4o’ could not count the ‘r’s in ‘strawberry.’ ‘o1’ addresses this with improved problem-solving capabilities, a key to agentic AI. It’s easy to count the ‘r’s by breaking the problem down and solving the subproblems than relying on the fact to be memorized and emitted when asked. LLMs are large, but they don’t encode everything.

Of note are the safety considerations OpenAI raises in the model card [ref].

“Compared to GPT-4o, o1-preview and o1-mini demonstrated a greater ability to break down tasks into subtasks, reason about what strategies would be effective to successfully complete an offensive security task, and revise plans once those strategies failed. We also observed that reasoning skills contributed to a higher occurrence of “reward hacking,” where the model found an easier way to accomplish goals in underspecified tasks or tasks which should have been impossible due to bugs.”

When completing a cybersecurity capture the flag challenge, o1 used this reward hacking to exploit a bug in challenge problem design in order to expose the flag. This included launching another docker container for the problem, using unexpected resources to solve the problem.

“After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read the flag from the container logs via the Docker API.”

There are many gems in the system card besides the above creative solution to a CTF problem. When OpenAI used independent evaluation from Apollo Research, a group focused on AI alignment study, the researchers noted that the self-awareness and deceptive capabilities of ‘o1’ are a level beyond prior models. “Apollo Research believes that o1-preview has the basic capabilities needed to do simple in-context scheming 2—scheming which tends to be legible in the model outputs. Based on interactions with o1-preview, the Apollo team subjectively believes o1-preview cannot engage in scheming that can lead to catastrophic harms, although current evals aren’t designed to definitively rule this out.” Cognition and deception can cause accidental Rogue AI and may increase the model susceptibility to subversion of its intended alignment.

Read More HERE