Google claims Big Sleep ‘first’ AI to spot freshly committed security bug that fuzzing missed
Google claims one of its AI models is the first of its kind to spot a memory safety vulnerability in the wild – specifically an exploitable stack buffer underflow in SQLite – which was then fixed before the buggy code’s official release.
The Chocolate Factory’s LLM-based bug-hunting tool, dubbed Big Sleep, is a collaboration between Google’s Project Zero and DeepMind. This software is said to be an evolution of earlier Project Naptime, announced in June.
SQLite is an open source database engine, and the stack buffer underflow vulnerability could have allowed an attacker to cause a crash or perhaps even achieve arbitrary code execution. More specifically, the crash or code execution would happen in the SQLite executable (not the library) due to a magic value of -1 accidentally being used at one point as an array index. There is an assert() in the code to catch the use of -1 as an index, but in release builds, this debug-level check would be removed.
Thus, a miscreant could cause a crash or achieve code execution on a victim’s machine by, perhaps, triggering that bad index bug with a maliciously crafted database shared with that user or through some SQL injection. Even the Googlers admit the flaw is non-trivial to exploit, so be aware that the severity of the hole is not really the news here – it’s that the web giant believes its AI has scored a first.
We’re told that fuzzing – feeding random and/or carefully crafted data into software to uncover exploitable bugs – didn’t find the issue.
The LLM, however, did. According to Google, this is the first time an AI agent has found a previously unknown exploitable memory-safety flaw in widely used real-world software. After Big Sleep clocked the bug in early October, having been told to go through a bunch of commits to the project’s source code, SQLite’s developers fixed it on the same day. Thus the flaw was removed before an official release.
“We think that this work has tremendous defensive potential,” the Big Sleep team crowed in a November 1 write-up. “Fuzzing has helped significantly, but we need an approach that can help defenders to find the bugs that are difficult (or impossible) to find by fuzzing, and we’re hopeful that AI can narrow this gap.”
We should note that in October, Seattle-based Protect AI announced a free, open source tool that it claimed can find zero-day vulnerabilities in Python codebases with an assist from Anthropic’s Claude AI model.
This tool is called Vulnhuntr and, according to its developers, it has found more than a dozen zero-day bugs in large, open source Python projects.
The two tools have different purposes, according to Google. “Our assertion in the blog post is that Big Sleep discovered the first unknown exploitable memory-safety issue in widely used real-world software,” a Google spokesperson told The Register, with our emphasis added. “The Python LLM finds different types of bugs that aren’t related to memory safety.”
Big Sleep, which is still in the research stage, has thus far used small programs with known vulnerabilities to evaluate its bug-finding prowess. This was its first real-world experiment.
For the test, the team collected several recent commits to the SQLite repository. After manually removing trivial and document-only changes, “we then adjusted the prompt to provide the agent with both the commit message and a diff for the change, and asked the agent to review the current repository (at HEAD) for related issues that might not have been fixed,” the team wrote.
The LLM, based on Gemini 1.5 Pro, ultimately found the bug, which was loosely related to changes in the seed commit [1976c3f7]. “This is not uncommon in manual variant analysis, understanding one bug in a codebase often leads a researcher to other problems,” the Googlers explained.
In the write-up, the Big Sleep team also detailed the “highlights” of the steps that the agent took to evaluate the code, find the vulnerability, crash the system, and then produce a root-cause analysis.
“However, we want to reiterate that these are highly experimental results,” they wrote. “The position of the Big Sleep team is that at present, it’s likely that a target-specific fuzzer would be at least as effective (at finding vulnerabilities).” ®
READ MORE HERE