AI in the enterprise

– 16 min read

Vector databases, graph databases, and knowledge graphs

How to choose the right data storage for enterprise RAG

Avatar photo

Deanna Dong

Nearly 54% of developers, according to 2023 StackOverflow research, find that waiting on answers to their questions often causes interruptions and frequently disrupts their workflows. 

The complement to this finding is also true: Nearly 47% of developers often find themselves answering questions they’ve answered before. The information is very often there, in other words, but it’s either locked inside the heads of teammates or buried so deep in other company resources that developers need someone’s help to find it. 

If that sounds like a waste of time, imagine the same dynamic scaled across every department. No employee is spared: At some point, every person in every company has a question and has to wait for an answer. 

This problem isn’t an unknown one. A variety of solutions have arisen over the years, including company intranets, wikis, and supposed paradigm shifts like digital transformation. All have posed advancements, but none have solved the seemingly simple but endlessly complex problem of getting questions answered. 

This history is why RAG, or retrieval-augmented generation, is often seen as the holy grail of knowledge management. 

By the time companies reach enterprise level, they’ve likely gone through numerous re-brands, restructurings, and pivots, with executives and board members coming and going and hundreds or thousands of employees onboarding and departing. In this context, answering even simple questions about the company can be challenging. 

AI – via RAG – promises to be the first real solution. With RAG, employees ask natural language questions and task database tools to retrieve the information and use generative AI to formulate that information into a readable, relevant answer. 

The initial results of RAG are powerful today and promising for what might come tomorrow, but those results also reveal significant limitations. Without a suitable database at the foundation, RAG can’t live up to its potential. 

Summarized by Writer

  • Vector databases store data in vector format, enabling efficient search but lacking context and relational information.
  • Graph databases prioritize relationships between data points, making them useful for modeling densely interconnected data but less efficient for large-scale processing and cross-database queries.
  • Knowledge graphs mimic human thinking by storing and connecting concepts, entities, relationships, and events using semantic descriptions, making them ideal for RAG-based search.
  • Knowledge graphs outperform vector and graph databases in search and retrieval tasks, especially for enterprise-scale RAG, due to their ability to preserve semantic relationships and encode structural information.
  • Choosing the right data storage technology for enterprise RAG requires careful evaluation of data processing, query retrieval, and LLM integration capabilities, with knowledge graphs emerging as the most suitable option.

What is a vector database?

A vector database stores and maintains data in vector formatting. When data is prepared for storage, it’s split into chunks of characters ranging from 100 to 200. Then, with an embedding model, these chunks are converted into a vector embedding that can be stored in the vector database. 

A vector database has many use cases, but a vector database, by definition, isn’t a complete foundation for RAG. Most vector databases don’t provide an embedding model, so companies typically need to find and integrate one to use a vector database as their foundation. 

Vector databases also vary significantly depending on which algorithms they use. Vector databases use either KNN (K-Nearest Neighbors) or ANN (Approximate Nearest Neighbor) algorithms, and each has different effects on the final result of any search and retrieval features.

When users enter a query, the vector database converts the query into a vector embedding, and either a KNN or ANN algorithm determines which data points are closest to the query data point. As the name of each implies, both algorithms are similar, with ANN being more approximate but much faster. 

Diagram of a vector database receiving information.

The strengths and weaknesses of vector databases

Vector databases have a range of strengths and weaknesses, but the weaknesses tend to come to the forefront when companies try to use them to build RAG features. These weaknesses also tend to be more consequential when companies face them at an enterprise scale.

Strengths

The primary strength of vector databases is that companies can store different data types, including text and images.

Beyond sheer storage, vector databases also enable search functions that are much better than typical keyword searches. If users are looking for data that has semantic similarity, a vector database can often help them find those data points, even if there isn’t a literal keyword match. 

Weaknesses

The primary weakness of vector databases comes from how vector databases process data. Especially in an enterprise context, where information retrieval needs to be granular and accurate, the crudeness of this data processing can show strain. When data is processed for vector storage, context is often lost. The relational context between data points is especially liable to get obscured. 

The advantage vector databases have over keyword search – that vector databases can identify data-point similarity based on nearness (KNN and ANN) – becomes a weakness when compared to other databases. Beyond sheer numerical proximity, vector databases don’t preserve context that informs the relationships between different data points. 

As Anthony Alcaraz, Chief AI Product Officer at fair hiring platform Fribl, writes, “Questions often have an indirect relationship to the actual answers they seek.” For example, consider a simple description of the origins of Apple:

Apple was founded as Apple Computer Company on April 1, 1976. The company was incorporated by Steve Wozniak and Steve Jobs in 1977. Its second computer, the Apple II, became a best seller as one of the first mass-produced microcomputers. Apple introduced the Lisa in 1983 and the Macintosh in 1984 as some of first computers to use a graphical user interface and mouse.

If this data is stored in a vector database, and a user queries, “When did Apple introduce the first Macintosh?”, correct data can turn into incorrect answers.

Given the crude chunking and the way KNN algorithms focus on mere proximity, the database might pull the closest chunks – “1983” and “Macintosh” instead of what the paragraph actually says. Given this wrong data, the LLM forms a confident but incorrect answer: “The first Macintosh was introduced in 1983.” 

This loss of context is, again, especially consequential in enterprise contexts because, in these environments, data tends to be very sparse or very dense. In either case, vector searches tend to struggle or even fail to find and return relevant or complete answers. This weakness worsens in high-dimensional environments, where KNN algorithms fail to find meaningful patterns – a problem known as the “curse of dimensionality.” 

Given the paucity of context, even an otherwise effective LLM will fail to formulate an accurate answer. Companies can run into the classic garbage-in and garbage-out problem: With little to no context and crude chunking, returned data points can be inaccurate or irrelevant to the query, setting LLMs up for failure. Generally, the more synthesis necessary, the worse vector databases tend to function. 

Vector databases also tend to run into scalability, performance, and cost issues when used for RAG. Large datasets can make KNN algorithms inefficient, for example, and because KNN algorithms store datasets in memory, processing those often large datasets can become resource-intensive. 

This intensiveness can increase quickly, too, because companies need to rerun and update the entire dataset when they enter new data. And as performance and scalability issues arise, so do costs. 

What is a graph database?

Like a vector database, a graph database stores and maintains data, but its data storage structure is unique. Whereas vector databases often lose relational context, graph databases give primacy to relationships by using nodes and edges between data points to form graphs.  

The unique nature of this relationship-first approach arose out of relational databases, an origin that makes graph databases worth considering for RAG. Relational databases store data in tables and organize similarly structured data closely together. Relational databases, however, don’t provide the ability to define relationships between tables. 

As Kaleb Nyquist, the Database Manager at the Center for Tech & Civic Life, writes, “Ironically, graph databases are actually more relationship-oriented than relational databases.”

Graph databases tend to be best for modeling densely interconnected data. Because graph databases model data similarly to object-oriented languages, the resulting database contains direct pointers to related data points. 

Diagram of a graph database showing how semantic relationships are mapped and encoded. Two yellow circles with human silhouettes are labeled "Danielle" and "Ann." A yellow circle with a car silhouette is labeled "Subaru." Arrows labeled "loves," "lives with," "drives," and "owns" indicate the relationships between the entities.
How semantic relationships are mapped and encoded in a graph database

The strengths and weaknesses of graph databases

Graph databases were a hot topic in the database world only a few years ago, and in many cases, the emphasis on relationships makes them a useful tool. But in the years since, users of graph databases have also found weaknesses that make graph databases – without supplementation – a less-than-ideal choice for RAG purposes. 

Strengths

Graph databases, because they give primacy to relationships, tend to be most advantageous in contexts where the relationships between data points are essential for making meaning. Unlike relational databases, graph databases provide a native way of storing the relationships themselves, allowing developers to store relationships as memory pointers that lead from one entity to the next. 

With the ability to define relationships directly, developers don’t need to worry about modeling those relationships in the database schema. Philipp Brunenberg, a ​​data engineer, explains the benefits: “We do not need to know about foreign keys, and neither do we have to write logic about how to store them. We define a schema of entities and relationships, and the system will take care of it.” 

Each relationship provides context between data points, but developers can also label the nodes and edges that form these relationships, allowing developers to assign weights and directionality to each relationship. As Brunenberg writes, “​​Direction provides meaning to the relationship; they could be no direction, one-way, or two-way.”

Graph databases also tend to be more easily understandable for non-technical users because the resulting models reflect the human mind and its visual dimensions. This visualization often makes them a top choice in contexts where knowledge needs to be retrieved, modeled, and displayed. 

Weaknesses

Graph databases have a few weaknesses, especially regarding efficiency and efficacy. 

Graph databases often run into efficiency problems when companies use them to process large volumes of data. In an enterprise context, where there can be a lot of sparse and dense data, graph database efficiency is especially likely to plummet. 

Graph databases are also less effective when used to run queries that extend across databases. The larger these databases are, the less effective these queries become. 

Developers are often drawn to graph databases because they’re known to be especially good at modeling relationships, but this advantage has limitations. Graph databases can, theoretically, model relationships well, but that doesn’t mean they can create better relationships. If data is poorly captured, the search and retrieval benefits won’t be fully realized.

What is a knowledge graph?

Unlike vector databases and graph databases, a knowledge graph is a data storage technique rather than a fundamentally different database. Knowledge graphs model how humans think — relationally and semantically — and go far beyond the numerical focus of vector databases and the relational focus of graph databases. 

The knowledge graph technique collects and connects concepts, entities, relationships, and events using semantic descriptions of each. Each description contributes to an overall network (or graph), meaning every entity connects to the next via semantic metadata. 

This technique for storing and mapping data mostly closely mimics how humans think in semantic contexts. This parallel makes it an ideal foundation for RAG-based search because RAG relies on natural language queries flowing through databases comprised primarily of semantic information. 

The Writer Knowledge Graph, in particular, integrates RAG, offering users:

  • Data connectors that link disparate data sources.
  • A specialized LLM that processes data at enterprise scales to build meaningful semantic relationships amongst dense and sparse data points.
  • A graph structure for data storage that can store both data points and data relationships while updating dynamically whenever new data is added.
  • Retrieval-aware compression that can condense data and index metadata without accuracy loss.
Diagram of a Knowledge Graph, showing the way information, relationships, events, and concepts are collected. Yellow circles, representing nodes in a knowledge graph, contain concepts like living things, animals, plants, cows, dogs, and herbs. Edges, represented as purple capsule shapes, contain connecting concepts such as "is" and "eat."
The knowledge graph technique collects and connects concepts, entities, relationships and events using semantic descriptions of each.

The strengths and weaknesses of knowledge graphs

Knowledge graphs, built on graph databases, have many of the same advantages that graph databases have over vector databases. Knowledge graphs, however, also present advantages over graph databases in particular contexts – especially RAG.

Strengths

Knowledge graphs, like graph databases, store data points and their relationships in a graph database. Like graph and vector databases, knowledge graphs can store a wide variety of file formats, including video, audio, and text. 

But in vector databases, queries are converted into a numerical format, often meaning context is lost. In knowledge graphs, queries don’t need to be reformatted, and the graph structure that uses these queries — because it preserves semantic relationships — allows for much more accurate retrieval than KNN or ANN algorithms can offer. 

Search and retrieval – central to RAG – is especially effective in knowledge graphs. In enterprise contexts, the differentiator between effective and ineffective search is often the ability to synthesize data across multiple sources. Knowledge graphs encode topical, semantic, temporal, and entity relationships into their graph structure, making synthesis possible. 

Relationships, however, aren’t always linear or one-way. With knowledge graphs, developers can encode hierarchies and other structural relationships. Given these structural relationships, knowledge graphs can map the connections between different points in different sources, even if they reference the same entities. 

“In contrast,” Alcaraz writes, “standard vector search lacks any notion of these structural relationships. Passages are treated atomically without any surrounding context.”

Context loss is one of the most common weaknesses of other databases, especially when used for RAG. With knowledge graphs, contextual information is retained because it is encoded in the retrieved information. 

Specialized LLMs, as used in the Writer Knowledge Graph, make this advantage even greater. By building richer semantic relationships, companies can preserve more context and maintain an intelligent definition of relationships across multiple dimensions. 

Weaknesses

Due to their emphasis on semantic information, knowledge graphs tend to have a lot of data to condense, often resulting in the need for significant computational power to support them. Sometimes, operations running across knowledge graphs can also be expensive, and that costliness can make them difficult to scale. 

And, similar to the weaknesses inherent to graph databases, knowledge graphs can’t take on the work of capturing and cleaning data well. Similarly, an effective knowledge graph will be hampered by an LLM that can’t formulate readable answers without hallucinations. 

How to choose the right data storage technology to support RAG for enterprise

RAG is a particular enough use case that choosing a general-purpose data storage technology, whether well-reputed or not, will likely not suit RAG. If you’re exposing RAG to users, the efficiency of the search and the accuracy of the retrieved information must be extremely high for those users to trust it. 

This article has covered a lot of different technologies and their nuances, so first, take a step back. Knowledge retrieval across any database requires three basic tasks. One way to compare these databases, then, is to evaluate which database technology will support each task. 

  • First, the database has to process the data. How will the database split volumes of data into smaller chunks to be stored in a data structure?
  • Second, the database has to offer the ability to query for information and retrieve it. How will the database use a query to retrieve relevant pieces of data that can answer that query?
  • Third, the database has to formulate data to allow an LLM to generate an answer. How well will the database send that relevant data to an LLM for answer generation?

Knowledge graphs stand out across these questions. Knowledge graphs are especially useful for companies using RAG, but the differences widen when companies use RAG at an enterprise scale. 

In a Knowledge Graph and LLM Accuracy Benchmark Report, researchers found that the accuracy of answers increased from 16%, based on GPT4 and SQL databases, to 54% when using a knowledge graph representation of the same SQL database. 

This accuracy difference isn’t just important for RAG; it’s often make or break. When users have questions, they, by definition, lack answers, so when a poorly implemented RAG returns inaccurate answers or hallucinations, the confusion can be consequential. With the stakes this high, it doesn’t take many mistakes for users to consider an otherwise accurate RAG untrustworthy. 

RAG success depends on your foundational choices

It’s best to ship fast and iterate in many software development contexts. 

Startups, for example, tend to benefit from building products with the most accessible technologies, shipping them to potential users, and figuring out which direction they’ll build in based on the results. Ship the product; get feedback on the product and the technologies supporting it; iterate; and ship again. 

Enterprises, however, particularly those building features that rely on high trust amongst users, can’t afford to rush in and iterate later. RAG depends so much on the earliest, most fundamental technical decisions that intensive research has to be done upfront. Iteration and feedback will still be necessary, but if enterprises build the wrong foundation, the iteration will eventually involve migration or rebuilding instead of incremental improvement.

The research challenge matches the complexity of the task. An enterprise contains a lot of information, but it’s not laid out like a book; an enterprise more closely resembles layers of sedimentary rock comprising many materials built up over many years. 
The complexity is emergent, and no amount of pre-work can make an enterprise’s information simple to parse, map, and retrieve. Instead, if companies want to build enterprise-ready RAG, they need to face the complexity with a tool built — from the ground up — to handle that complexity.

Knowledge Graph, the graph-based retrieval-augmented generation (RAG) from Writer, achieves higher accuracy than traditional RAG approaches that use vector retrieval. Learn more and request a demo.