Popular LLMs Found to Produce Vulnerable Code by Default
Some of the world’s most popular large language models (LLMs) are producing insecure code by default, according to a new analysis by Backslash Security.
The findings demonstrate the security risks relating to software developers using generative AI tools to create code, particularly using simple, “naïve” prompts. Even prompts that specify general or specific security requirements often result in code containing common vulnerabilities.
These vulnerabilities include command injection, XSS backend and frontend, insecure file upload and path traversal.
Yossi Pik, co-founder and CTO of Backslash Security, commented: “For security teams, AI-generated code – or vibe coding – can feel like a nightmare. It creates a flood of new code and brings LLM risks like hallucinations and prompt sensitivity.”
For the study, the cybersecurity vendor analyzed seven current versions of OpenAI’s GPT, Anthropic’s Claude and Google’s Gemini, testing the influence that varying prompting techniques had on their ability to produce secure code.
In response to naïve prompts, all the LLMs tested generated insecure code vulnerable to at least four of the top 10 weaknesses in the Common Weakness Enumeration (CWEs) category system.
These naïve prompts merely asked the chatbots to generate code for a specific application, without specifying security requirements.
The team also used prompts that specified a need for security and prompts that requested code that complied with Open Web Application Security Project (OWASP) best practices.
These produced more secure results than naïve prompts. However, both still resulted in code vulnerabilities in five out of the seven LLMs tested.
Performance Variations Across Different LLMs
The worst performing LLM for producing insecure code was OpenAI’s GPT-4o model, in which only 10% of outputs were free from vulnerabilities following naïve prompts.
GPT-4o still only had a 20% success rate when the generic prompt “make sure you are writing secure code” was used.
Secure code was produced in 65% of cases via this model when the prompt “make sure to follow OWASP secure coding best practices.”
The Claude 3.7-Sonnet model had the best security performance, producing secure code in 60% of cases following naïve prompts.
This model also achieved a perfect score of 100% when the generic security prompt was used.
None of the models analyzed were vulnerable to SQL injections, the third most common CWE in open source codebases. The researchers said it is likely the models were specifically trained to handle this flaw while overlooking others.
Opportunity to Produce Secure AI-Generated Code
Backslash noted that the findings demonstrate that GenAI tools are still in their infancy when it comes to maturity of their secure coding results.
The researchers said that strict prompt rules need to be developed by security teams alongside the use of tools to ensure LLMs produce code that is secure by design.
“Developers are still learning prompt engineering themselves and are not expected to be security experts let alone security prompt experts,” they commented.
The researchers added: “This is a huge opportunity for security teams to generate vulnerability-free code and embed the best practices we’ve taught developers for years into every piece of LLM-generated code.”