The Road to Agentic AI: Exposed Foundations
Report highlights:
- Retrieval augmented generation (RAG) enables enterprises to build customized, efficient, and cost-effective applications based on private data. However, research reveals significant security risks, such as exposed vector stores and LLM-hosting platforms, which can lead to data leaks, unauthorized access, and potential system manipulation if not properly secured.
- Security issues such as data validation bugs and denial-of-service attacks are prevalent across RAG components. This is compounded by their rapid development cycle, which makes tracking and addressing vulnerabilities challenging.
- Research identified 80 exposed llama.cpp servers, 57 of which lacked authentication. Exposed servers were concentrated in the United States, followed by China, Germany, and France, reflecting global adoption with varying levels of security practices.
- Beyond authentication, enterprises must implement TLS encryption and enforce zero-trust networking to ensure that generative AI systems and their components are shielded from unauthorized access and manipulation.
“Move fast and break things” seems to be the current motto in the field of AI. Ever since the introduction of ChatGPT in 2022, it seems everyone is jumping on the bandwagon. In some fields, people have been happy to just use OpenAI’s offerings, but many enterprises have specialized needs. As Nick Turley, OpenAI’s head of product, recently said, LLMs are a “calculator for words” and this new technology has opened up many possibilities for enterprises. However, some engineering is needed to use this “word calculator” effectively and while we wait for proper agentic AI systems, the current technology of choice is retrieval augmented generation (RAG).
RAG needs a few ingredients to run. It needs a database of text chunks and a way of retrieving them. We usually use a vector store for this, which saves the text and a series of numbers that helps us find the most relevant text chunks. With these and an appropriate prompt, we can often answer questions or compose new texts that are based on private data sources and are relevant for our needs. Indeed, RAG is so effective that the most powerful large language (LLM) models are not always needed. To save costs and improve response time, we can use our own servers to host these smaller and lighter LLM models.
As an analogy, the vector store is like a very helpful librarian who not only chooses the relevant books but also highlights the relevant passages. The LLM is then the researcher who takes these highlighted texts and uses them to write the paper or answer the question. Together, they form a RAG application.
Vector stores are not completely new, but have been seeing a renaissance over the last two years. While there are many hosted solutions like Pinecone, there are also self-hosted solutions like ChromaDB or Weaviate (https://weaviate.io). They allow a developer to find text chunks similar to the input text, such as a question that needs to be answered.
Hosting one’s own LLM does require a decent amount of memory and a good GPU, but this is not anything that a cloud provider can provide. For those with a good laptop or PC, LMStudio is a popular pick. For enterprise use, llama.cpp and Ollama are often the first choice. All of these have seen the sort of rapid development we have rarely seen, so it should be no surprise that some bugs have crept in.
Some of these bugs in RAG components are in typical data validation bugs, such as CVE-2024-37032 and CVE-2024-39720. Others lead to denial of service, like CVE-2024-39720 and CVE-2024-39721, or leaks the existence of files, like CVE-2024-39719 and CVE-2024-39722. The list goes on. Less is known about llama.cpp, but CVE-2024-42479 was found this year, while CVE-2024-34359 affects the Python library using llama.cpp. Perhaps less is known about llama.cpp due to its blistering release cycle. Since its inception in March 2023, there have been over 2,500 releases, or around four a day. With a moving target like that, it is hard to track its vulnerabilities.
In contrast, Ollama maintains a more leisurely release cycle of only 96 releases since July 2023, about once a week. In contrast, Linux is released every few months and Windows sees new “Moments” every quarter.
The vector store, ChromaDB, has been around since October 2022 and releases roughly every two weeks. Interestingly, there are no known CVEs directly associated with it. Weaviate, another vector store, has also been found to have vulnerabilities (CVE-2023-38976 and CVE-2024-45846 when used with MindsDB). It has been around since 2019, making it a veritable grandfather of this technology stack but still manages a weekly release cycle. There is nothing stable about any of these release cycles, but it does mean that bugs get patched quickly when found, limiting their exposure time.
LLMs on their own are not likely to fulfill all needs and are only incrementally improving as they run out of public data to train on. The future is likely to be an agentic AI, one that combines LLMs, memory, tools, and workflow into more advanced AI-based systems, as championed by Andrew Ng. Essentially, this is a new software development stack and the LLMs and the vector stores will continue to play a major role here.
But along this path, enterprises are going to get hurt if they do not pay attention to the security of their systems.
We were worried that in their haste, many developers would expose these systems to the internet and so we searched for instances of some of these RAG components on the internet in November 2024. We focused on the four top components used in RAG systems: llama.cpp, Ollama that hosts LLMs, and ChromaDB and Weaviate, which are vector stores.
llama.cpp exposed
llama.cpp is used to host a single LLM model and is a REST service, i.e., the server communicates to the client with POST requests just as defined by the HTTP protocol. In our testing, there were some fluctuations in the numbers we saw. On the last count, however, we saw 80 exposed servers, and 57 of them did not appear to have any form of authentication. There is a good chance that these numbers are low, and that more servers are better hidden but equally open.
The models that were hosted on the llama.cpp servers were mainly Llama 3-derived models, followed by Mistral models. Many of these are known jailbroken models, but most were models that are not widely known and were also probably fine-tuned for specific purposes.
Read More HERE