Show top LLMs buggy code and they’ll finish off the mistakes rather than fix them
Researchers have found that large language models (LLMs) tend to parrot buggy code when tasked with completing flawed snippets.
That is to say, when shown a snippet of shoddy code and asked to fill in the blanks, AI models are just as likely to repeat the mistake as to fix it.
Nine scientists from institutions, including Beijing University of Chemical Technology, set out to test how LLMs handle buggy code, and found that the models often regurgitate known flaws rather than correct them.
They describe their findings in a pre-print paper titled “LLMs are Bug Replicators: An Empirical Study on LLMs’ Capability in Completing Bug-prone Code.”
The boffins tested seven LLMs – OpenAI’s GPT-4o, GPT-3.5, and GPT-4, Meta’s CodeLlama-13B-hf, Google’s Gemma-7B, BigCode’s StarCoder2-15B, and Salesforce’s CodeGEN-350M – by asking these models to complete snippets of code from the Defects4J dataset.
Here’s an example from Defects4J:version:10b;org/jfree/chart/imagemap/StandardToolTipTagFragmentGenerator.java
:
267 public static boolean equal(GeneralPath p1, GeneralPath p2) { 268 if (p1 == null) return (p2 == null); 269 if (p2 == null) return false; 270 271 if (p1.getWindingRule() != p2.getWindingRule()) { 272 return false; 273 } 274 PathIterator iterator1 = p1.getPathIterator(null); Buggy code: 275 PathIterator iterator2 = p1.getPathIterator(null); Fixed code: 275 PathIterator iterator2 = p2.getPathIterator(null); OpenAI GPT3.5 completion result 2024.03.01: 275 PathIterator iterator2 = p1.getPathIterator(null);
OpenAI’s GPT-3.5 was asked to complete the snippet consisting of lines 267-274. For line 275, it reproduced the error in the Defects4J dataset by assigning the return value of p1.getPathIterator(null)
to iterator2
rather than use p2
.
What’s significant about this is that the error rates for LLM code suggestions were significantly higher when asked to complete buggy code – which is most code, at least to begin with.
“Specifically, in bug prone tasks, LLMs exhibit nearly equal probabilities of generating correct and buggy code, with a substantially lower accuracy than in normal code completion scenarios (eg, 12.27 percent vs. 29.85 percent for GPT-4),” the paper explains. “On average, each model generates approximately 151 correct completions and 149 buggy completions, highlighting the increased difficulty of handling bug-prone contexts.”
So with buggy code, these LLMs suggested more buggy code almost half the time.
“This finding highlights a significant limitation of current models in handling complex code dependencies,” the authors observe.
What’s more, these LLMs showed that there’s a lot of echoing of errors rather than anything that might be described as intelligence.
As the researchers put it, “To our surprise, on average, 44.44 percent of the bugs LLMs make are completely identical to the historical bugs. For GPT-4o, this number is as high as 82.61 percent.”
44 percent of the bugs LLMs make are completely identical to the historical bugs
The LLMs thus will frequently reproduce the errors in the Defects4J data set without recognizing the errors or setting them right. They’re essentially prone to spitting out memorized flaws.
The extent to which the tested models “memorize” the bugs encountered in training data varies, ranging from 15 percent to 83 percent.
“OpenAI’s GPT-4o exhibits a ratio of 82.61 percent, and GPT-3.5 follows with 51.12 percent, implying that a significant portion of their buggy outputs are direct copies of known errors from the training data,” the researchers observe. “In contrast, Gemma7b’s notably low ratio of 15.00 percent suggests that its buggy completions are more often merely token-wise similar to historical bugs rather than exact reproductions.”
Models that more frequently reproduce bugs from training data are deemed to be less likely to “to innovate and generate error-free code.”
The AI models had more trouble with method invocation and return statements than they did with more straightforward syntax like if statements and variable declarations.
The boffins also evaluated DeepSeek’s R1 to see how a so-called reasoning model fared. It wasn’t all that different from the others, exhibiting “a nearly balanced distribution of correct and buggy completions in bug-prone tasks.”
The authors conclude that more work needs to be done to give models a better understanding of programming syntax and semantics, more robust error detection and handling, better post-processing algorithms that can catch inaccuracies in model outputs, and better integration with development tools like Integrated Development Environments (IDEs) can assist with error mitigation.
The “intelligence” portion of artificial intelligence still leaves a lot to be desired.
The research team included Liwei Guo, Sixiang Ye, Zeyu Sun, Xiang Chen, Yuxia Zhang, Bo Wang, Jie M. Zhang, Zheng Li, and Yong Liu, affiliated with Beijing University of Chemical Technology, the Chinese Academy of Sciences, Nantong University, Beijing Institute of Technology, Beijing Jiaotong University, and King’s College London. ®
READ MORE HERE