Don’t RAG on Knowledge Graphs(Or Do): RAG, Knowledge Graphs, and Benchmarks – Part Zero

Foundational preamble to the benchmarking of knowledge graph-centric RAG flows
knowledge-graphs
rag
benchmarking
Author

Dmitriy Leybel

Published

April 5, 2024

Abstract
Out of the many retrieval algorithms, building knowledge graphs in conjuction with vector stores is a promising path forward as tracing and veracity of LLMs becomes more and more critical in the successful adoption and application in this newfound frontier of AI. It is my goal to convince you that the combination of symbolic representational knowledge and semantic embeddings is a powerful venue to explore in this space.

It’s finally happening. It, being me writing a blog entry. (I’m editing this at 10k words. I guess it’s more of an article and less of a blog entry)


The motivation behind this series of posts is twofold, to run a basic knowledge graph Retrieval Augmented Generation(RAG) benchmark I can build off of and iteratively improve, and secondarily, to give the reader a ride-along of the process, from choosing a benchmark, creating a knowledge graph, connecting the knowledge graph to a vector store, and so forth. I am going to break down the components of not only a RAG system, but also introduce the necessary parts of any LLM workflow - so there will certainly be something for everyone. You are free to use the table of contents to skip around to what interests you most, or embark on an end-to-end marathon read.

I fully believe in democratizing the ability to build and test your own LLM tools, as they are a critical frontier of artificial intelligence. That is the path towards progress and away from the centralization of these fantastic technologies.

1 Background

1.1 RAG

Large Language Models(LLMs) are fantastic…that is, until you attempt to verify their output.

For this reason, RAG has been a fundamental component of truthiness. It also allows you to augment the LLM output through context-stuffing. The amount of tokens you can stuff into your context is not limitless, and so you can’t merely stuff all of your documents and tables into it. Out of this limit emerge dozens of RAG techniques which try to hydrate the prompt. The fine folks at Langchain have illustrated a small portion of these techniques here(Fig 1). Even with the promise of a 10 million token context window, there is no abatement of upcoming RAG techniques and companies built around it.

Figure 1: Soiree of RAG techniques
(Click to enlarge)
Source: Langchain blog

An ever-growing survey of these techniques exists - and even that is not fully exhaustive. P.S. exa.ai is a fantastic source for research.

For reference, here(Fig 2) is a diagram of one of the simplest versions of RAG being implemented.

Figure 2: Basic RAG Example

1.2 Knowledge Graphs

It’s much easier to illustrate than explicate what a knowledge graph is(Fig 3). Below, we have a knowledge graph that represents places, people, and things along with their relationships to one another. This is a directed graph, in the sense that the connections flow in one direction - this generally makes it easier to specify the relationships between entities. There are many names for the entities within a knowledge graph as well as the connections between them; one of the most common naming conventions for them are nodes for the entities such as “Bob” or “The Louvre” and edges for the connections between the nodes such as “was created by” or “is located in”. Additionally, these nodes and edges can both have properties or attributes - for instance, the ‘Museum’ node can have attributes that enrich it such as “capacity: 2,000” and the edge ‘visited’ can be assigned a date attribute “date: March 28th, 2005”. You’ll often hear the word triple in reference to two nodes connected by an edge(Node A, Edge, Node B)

Figure 3: An example of a basic knowledge graph.

Knowledge graphs are often created within graph databases such as Neo4j, memgraph, or Amazon Neptune. They are often used within enterprises to integrate data from structured and unstructured databases alike to enable a single source of truth or knowledge. In theory, they are fantastic tools for information storage and retrieval, however, in practice they have a lot of quirks that prevent many companies from using them. The distillation of a company’s data into a neat set of nodes and edges is a complex task that requires knowledge graph experts, as well as alignment from all corners of the organization.

While the appeal of knowledge graphs is immense because it appeals to our intuitive sense of informational organization and structure, you can see for yourself how difficult the task is by trying to organize the things on your desk into a knowledge graph. Your brain has no problem with making sense of it all and maintaining its own knowledge representation of what’s in front of your nose, but reproducing it in a knowledge graph is not as straightforward as our intuition leads us to believe.

“Are you done sh*tting on knowledge graphs, Dmitriy?”

1.2.1 LLM Synergy

Yes. In fact, here I am proudly generating a knowledge graph for the world to see.

It may seem like this is the start of an all-hands meeting that’s 45 minutes too long, but I promise you that it’s not(unless you want it to be?). The word ‘synergy’ is perfect for describing the relationship between LLMs and knowledge graphs. The lowest hanging fruit for this match made in heaven was writing queries. Given a schema, an LLM can query a graph database to retrieve information.

Some graph databases can be queried with Cypher(a graph querying language):

MATCH (n:Person)-[r:KNOWS]->(m:Person)
WHERE n.name = 'Alice'
RETURN n, r, m

If you’re familiar with SQL, you immediately see the similarities. This query returns the person node n with the name Alice and all of the people(m) she knows(r). Fortunately, LLMs are superb at query languages, so your Cypher prowess can be minimal to nonexistent in order to compose this masterpiece:

yo chatgpt, this is my graph db’s schema:schema here I need you to write a Cypher query that returns all of the people Alice knows

Cool. Now we can fire all of these data analysts, right? Maybe next year. (DISCLAIMER: this is a joke, not business advice)

Query generation turns out to be fairly popular, with frameworks like Langchain and LlamaIndex creating modules to do just that. Turns out, using LLMs, we can not only build queries, but we can build the knowledge graph itself. I will later go over this at length, so to be brief, you can have a LLM go over a set of documents chunk by chunk and output these triplets of nodes and edges iteratively. After loading them into a graph database, you can end the process there and trot along with your newly minted database, or you can now let the LLM create queries against that database as described earlier.

Langchain and LlamaIndex also have their own plug-n-play knowledge graph creation modules.

At this point, like any rational human being, you may be asking, can this get any better? I mean, you’ve lasted this long, so I imagine that you already know the answer.

1.3 RAG + Knowledge Graphs

Important

Remember, there is more than one way you can skin a cat. The examples provided are merely the ones I believe are most illustrative of the main components. The extent of the composability is only limited by your imagination.

When you combine RAG with knowledge graphs, you get the best of both worlds. On one hand, you get a fuzzy(probabalistic) semantic layer which can be used to compare the essence of sentences or paragraphs to via embeddings. On the other, you have a discrete and symbolic representation of knowledge. That sounds an awful lot like humans – vibes-based logical processors.

There are limitless ways to construct a system that exploits both modalities, so I’m going to focus on the base cases. The fundamental relationship takes place between the vector embeddings and the knowledge graph. The nodes(and in some cases, the edges) are linked to an embedding related to their source material.

The first objective is to use an LLM to create the knowledge graph in conjunction with the embeddings. The embeddings will be stored in a vector database or vector store, which is essentially an optimized container that allows extremely fast vector comparison operations so you can quickly find the most similar embeddings. Some vector databases live in the cloud(Pinecone), they can be self-hosted(Chroma), or they can stay in your very program’s memory(FAISS). Fig 4 illustrates the fundamentals of generating your knowledge graph and vector store.

1.3.1 Generating Knowledge Graphs and Populating Vector Stores

Once a corpus of documents is chunked into pieces, those pieces can be processed by the LLM and converted into triples which are then loaded into the knowledge graph. Concurrently, embeddings are created for the document chunks and loaded into the vector store. In each case, you can attach the reference for the node or embedding in its respective twin – this is where the magic lies. The text from the chunked documents can be stored in either the knowledge graph or in the vector store, or both. Once both are established, there are multiple retrieval strategies we can use to take advantage of this system.

Note

Building the knowledge graph sounds simpler than it is, and just as the architectural design of these systems, it is open to myriads of potential options – some good, and some not so good. This will be addressed.

Figure 4: Knowledge Graph and Embedding Generation

1.3.2 Retrieval Strategy #1 Focused on Embeddings Search Followed by Knowledge Graph Adjacency

With a populated vector store and knowledge graph, we are set to experiment with a wide array of retrieval strategies in pursuit of finding the best one to hydrate our prompt. Fig 5 involves using the vector store to find the nearest matching embedding, find its reference in the knowledge graph, and then find the adjacent nodes within the knowledge graph to add to our prompt. This makes intuitive sense because concepts related to the initial node are likely to be relevant for the LLM in addressing the user’s query.

Figure 5: One strategy of retrieval through first finding a close embedding, and then utilizing the adjacency of nodes in the knowledge graph to hydrate the prompt

1.3.3 Retrieval Strategy #2 Focused on Graph Query Generation

Another retrieval strategy would switch the knowledge graph and vector store steps around. This will involve an extra call to the LLM in order to construct the query we’ll send to the knowledge graph. Once the nodes(and edges) are returned, we can trace the node to its referenced embedding, and retrieve the neighborhood of embeddings along with their text. Alternatively, we can ignore the embeddings and simply focus on the neighborhood of the returned nodes. For the example in Fig 6, I’ll focus on the former. As much as we both love flowcharts, I have a feeling you’re getting somewhat tired of them. That said, here’s one more.

Figure 6: Another strategy for retrieval is to generate queries against the graph database containing the knowledge graph, and then.

2 Finding a benchmark

In order to benchmark the performance of this RAG + Knowledge Graph flow, we need to find a dataset or datasets that are commonly used for benchmarking RAG pipelines as well as some metrics used. We can go back to the survey mentioned in the section on RAG and look at its corresponding arxiv.org paper. Within it, there is a table of tasks as seen in Fig 7 and a table of metrics used Fig 8.

Figure 7: RAG Datasets
Figure 8: RAG Metrics

This is a perfect starting point because now we have a smorgasbord of references to peruse to gain an understanding of how to best proceed with benchmarking. The first option that comes to mind is the GraphQA subtask; however, looking into the mentioned paper on arxiv, G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering, it is evident that it is concerned with creating a graph dataset for the purpose of benchmarking the ability of an LLM to chat with graphs. Which adjacently relevant, our current goal is to use knowledge graphs as tools in retrieval, and not the main subject of a question-answering(QA) task.

2.1 Hotpot and Beer?

Single-hop benchmarking appears to be most popular according to the RAG survey, however, we have more faith in knowledge graphs than a measly single-hop reasoning task. A single-hop requires the information from a single document to answer a question, however a multi-hop task requires you to hop between documents in order to answer the question. HotPotQA appears to be the most popular multi-hop dataset. Mentioned immediately on the HotPotQA website is another dataset which they shout out as newer, with a more diverse set of hops required, while including the HotPotQA dataset within it - BeerQA(is anyone else thirsty…). It combines QA data from three datasets, being HotPotQA, SQuAD, and its own formulated questions from Wikipedia for even more hops. Upon further inspection, BeerQA specifies that it primarily focuses on a fullwiki evaluation, that is to say, you must use the entirety of wikipedia in the task. Due to time and resource constraints, we do not currently want to build a knowledge graph from a 24GB dataset from the get-go. We do, however, want to be able to iterate in a quick and agile manner. HotPotQA doesn’t have the same compute-heavy requirement, and neither does another amusingly named dataset.

2.2 MuSiQue to my ears

According to the HotpotQA paper, it also has the option for using the full wiki in the evaluation criterion, however, it also has a distractor option where you’re given 2 ‘gold’ paragraphs containing the connecting information coupled with 8 irrelevant ‘distractors’ that serve as noise. Another dataset was created as an improvement over HotpotQA as well as its successor, 2WikiMultihopQA – MuSiQue(Multihop Questions via Single-hop QUestion Composition) improves over its predecessors and includes questions with upwards of 18 distractors and numerous gold paragraphs in order to create questions of up to 4 hops. In addition, it handles some cases that would’ve allowed cheating within HotPotQA(such as inferring the information that is present in the hops). In addition, MuSiQue adds answerability to the mix – roughly half of the questions are unanswerable given the data, with the breadcrumbs provided by the provided distractors being misleading.

This is a great augmentation because this is the type of eval that will often be present in the real world, since we often expect real world retrieval of information retrieval to come up short.

Figure 9: Answerable and Nonanswerable Multihop Questions

MuSiQue contains two evaluations, one with only answerable questions, and the other, evenly divided between non-answerable and answerable questions. If we look at the MuSiQue leaderboards in Fig 10, we see that the F1 score(harmonic mean of precision and recall – the higher the better) is substantially better for the Answerable dataset, as it removed the option of there being unanswerable questions for the models to hallucinate on.

Figure 10: MuSiQue Leaderboard Comparison

Before wrapping up, I’d like to at least share some of the dev dataset meant to be used in the development of your data pipeline, published in the MuSiQue github repo.

Code
import jsonlines

js_list = []
with jsonlines.open('data/musique_full_v1.0_dev.jsonl') as reader:
    for i in range(5):
        js_list.append(reader.read())
js_list[0]
{'id': '2hop__153573_109006',
 'paragraphs': [{'idx': 0,
   'title': 'History of the Internet',
   'paragraph_text': "Precursors to the web browser emerged in the form of hyperlinked applications during the mid and late 1980s (the bare concept of hyperlinking had by then existed for some decades). Following these, Tim Berners - Lee is credited with inventing the World Wide Web in 1989 and developing in 1990 both the first web server, and the first web browser, called WorldWideWeb (no spaces) and later renamed Nexus. Many others were soon developed, with Marc Andreessen's 1993 Mosaic (later Netscape), being particularly easy to use and install, and often credited with sparking the internet boom of the 1990s. Today, the major web browsers are Firefox, Internet Explorer, Google Chrome, Opera and Safari.",
   'is_supporting': False},
  {'idx': 1,
   'title': 'Ceville',
   'paragraph_text': "Ceville is a humorous graphic adventure video game developed by the German game studio Realmforge Studios and published by Kalypso Media. Despite the game's use of 3D environments and models, the gameplay is very true to the graphical point-and-click adventure tradition of gameplay, immortalized by game series like Monkey Island from LucasArts and the King's Quest series from Sierra Online.",
   'is_supporting': False},
  {'idx': 2,
   'title': 'Zipline Safari',
   'paragraph_text': "Zipline Safari is a zip-line course in Florida. It is the only zip-line course in the state, and is claimed to be the world's only zip-line created for flat land. Zipline Safari opened on 16 January 2009 in Forever Florida, a wildlife attraction near Holopaw, Florida. The zip-line cost $350,000 to build, and consists of nine platforms built up from the ground and traveled between by zip-lining. Forever Florida built the course to promote ecotourism and interaction with the natural environment of Florida.",
   'is_supporting': False},
  {'idx': 3,
   'title': 'Parc Safari',
   'paragraph_text': "Parc Safari is a zoo in Hemmingford, Quebec, Canada, and is one of the region's major tourist attractions; that has both African & Asian species of elephant.",
   'is_supporting': False},
  {'idx': 4,
   'title': 'The Reporter (TV series)',
   'paragraph_text': 'The Reporter is an American drama series that aired on CBS from September 25 to December 18, 1964. The series was created by Jerome Weidman and developed by executive producers Keefe Brasselle and John Simon.',
   'is_supporting': False},
  {'idx': 5,
   'title': 'Earthworm Jim 4',
   'paragraph_text': 'Earthworm Jim 4 is a video game in the "Earthworm Jim" series. It was originally announced by Interplay Entertainment in 2008, and referred to by Interplay as "still in development" in May 2011. Later commentary from individual developers would claim that development hadn\'t started, though desire to create a new entry in the series remained. In May 2019, it was announced that the game was to begin development exclusively for the upcoming Intellivision Amico console.',
   'is_supporting': False},
  {'idx': 6,
   'title': 'Adobe Flash Player',
   'paragraph_text': 'Availability on desktop operating systems Platform Latest version Browser support Windows XP and later Windows Server 2003 and later 27.0. 0.183 Firefox, Chrome, Chromium, Safari, Opera, Internet Explorer, Microsoft Edge Windows 2000 11.1. 102.55? Windows 98 and ME 9.0. 115.0? Windows 95 and NT 4 7.0. 14.0? Mac OS X 10.6 or later 27.0. 0.183 Firefox, Chrome, Chromium, Safari, Opera Mac OS X 10.5 10.3. 183.90? Classic Mac OS, PowerPC 7.0. 14.0? Classic Mac OS, 68k 5.0? Linux 27.0. 0.183 Firefox, Chrome, Chromium, Opera',
   'is_supporting': False},
  {'idx': 7,
   'title': 'Apple Inc.',
   'paragraph_text': "Apple Inc. is an American multinational technology company headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software, and online services. The company's hardware products include the iPhone smartphone, the iPad tablet computer, the Mac personal computer, the iPod portable media player, the Apple Watch smartwatch, the Apple TV digital media player, and the HomePod smart speaker. Apple's consumer software includes the macOS and iOS operating systems, the iTunes media player, the Safari web browser, and the iLife and iWork creativity and productivity suites. Its online services include the iTunes Store, the iOS App Store and Mac App Store, Apple Music, and iCloud.",
   'is_supporting': False},
  {'idx': 8,
   'title': 'Philadelphia Zoo',
   'paragraph_text': 'The Philadelphia Zoo, located in the Centennial District of Philadelphia, Pennsylvania, on the west bank of the Schuylkill River, was the first true zoo in the United States. Chartered by the Commonwealth of Pennsylvania on March 21, 1859, its opening was delayed by the American Civil War until July 1, 1874. It opened with 1,000 animals and an admission price of 25 cents. For a brief time, the zoo also housed animals brought over from safari on behalf of the Smithsonian Institution, which had not yet built the National Zoo.',
   'is_supporting': False},
  {'idx': 9,
   'title': 'Web browser',
   'paragraph_text': "Apple's Safari had its first beta release in January 2003; as of April 2011, it had a dominant share of Apple-based web browsing, accounting for just over 7% of the entire browser market.",
   'is_supporting': False},
  {'idx': 10,
   'title': 'List of The 100 episodes',
   'paragraph_text': 'The 100 (pronounced The Hundred) is an American post-apocalyptic science fiction drama television series developed by Jason Rothenberg, which premiered on March 19, 2014, on The CW. It is loosely based on a 2013 book of the same name, the first in a book series by Kass Morgan. The series follows a group of teens as they become the first people from a space habitat to return to Earth after a devastating nuclear apocalypse.',
   'is_supporting': False},
  {'idx': 11,
   'title': 'Shiira',
   'paragraph_text': 'Shiira (シイラ, Japanese for the common dolphin-fish) is a discontinued open source web browser for the Mac OS X operating system. According to its lead developer Makoto Kinoshita, the goal of Shiira was "to create a browser that is better and more useful than Safari". Shiira used WebKit for rendering and scripting. The project reached version 2.3 before it was discontinued, and by December 2011 the developer\'s website had been removed.',
   'is_supporting': False},
  {'idx': 12,
   'title': 'Traffic Department 2192',
   'paragraph_text': 'Traffic Department 2192 is a top down shooter game for IBM PC, developed by P-Squared Productions and released in 1994 by Safari Software and distributed by Epic MegaGames. The full game contains three episodes (Alpha, Beta, Gamma), each with twenty missions, in which the player pilots a "hoverskid" about a war-torn city to complete certain mission objectives. The game was released as freeware under the Creative Commons License CC BY-ND 3.0 in 2007.',
   'is_supporting': False},
  {'idx': 13,
   'title': 'Maciej Stachowiak',
   'paragraph_text': "Maciej Stachowiak (; born June 6, 1976) is a Polish American software developer currently employed by Apple Inc., where he is a leader of the development team responsible for the Safari web browser and WebKit Framework. A longtime proponent of open source software, Stachowiak was involved with the SCWM, GNOME and Nautilus projects for Linux before joining Apple. He is actively involved the development of web standards, and is a co-chair of the World Wide Web Consortium's HTML 5 working group and a member of the Web Hypertext Application Technology Working Group steering committee.",
   'is_supporting': False},
  {'idx': 14,
   'title': 'Ellery Queen (TV series)',
   'paragraph_text': 'Ellery Queen is an American TV series, developed by Richard Levinson and William Link, who based it on the fictional character of the same name. The series ran on NBC from September 11, 1975, to April 4, 1976 featuring the titular fictional sleuth. The series stars Jim Hutton as the titular character, and David Wayne as his father, Inspector Richard Queen.',
   'is_supporting': False},
  {'idx': 15,
   'title': 'Hunting',
   'paragraph_text': 'In the 19th century, southern and central European sport hunters often pursued game only for a trophy, usually the head or pelt of an animal, which was then displayed as a sign of prowess. The rest of the animal was typically discarded. Some cultures, however, disapprove of such waste. In Nordic countries, hunting for trophies was—and still is—frowned upon. Hunting in North America in the 19th century was done primarily as a way to supplement food supplies, although it is now undertaken mainly for sport.[citation needed] The safari method of hunting was a development of sport hunting that saw elaborate travel in Africa, India and other places in pursuit of trophies. In modern times, trophy hunting persists and is a significant industry in some areas.[citation needed]',
   'is_supporting': False},
  {'idx': 16,
   'title': 'Safari School',
   'paragraph_text': 'Safari School is a BBC Two reality television series presented by Dr Charlotte Uhlenbroek in which eight celebrities take part in a four-week ranger training course in the Shamwari Game Reserve in South Africa.',
   'is_supporting': False},
  {'idx': 17,
   'title': 'African Safari Wildlife Park',
   'paragraph_text': 'The African Safari Wildlife Park is a drive through wildlife park in Port Clinton, Ohio, United States. Visitors can drive through the preserve and watch and feed the animals from their car. Visitors can spend as much time in the preserve as they wish, observing and feeding the animals, before proceeding to the walk through part of the park, called Safari Junction. The park is closed during the winter.',
   'is_supporting': False},
  {'idx': 18,
   'title': 'White armored car',
   'paragraph_text': 'The White armored car was a series of armored cars developed by the White Motor Company in Cleveland, Ohio from 1915.',
   'is_supporting': False},
  {'idx': 19,
   'title': 'Blue Tea Games',
   'paragraph_text': 'The 14th game of this series. The BETA game was released in September 2017. This episode will be developed by Blue Tea Games who return to the series since 2014.',
   'is_supporting': False}],
 'question': "Who developed the eponymous character from the series that contains Mickey's Safari in Letterland?",
 'question_decomposition': [{'id': 153573,
   'question': "What series is Mickey's Safari in Letterland from?",
   'answer': 'Mickey Mouse',
   'paragraph_support_idx': None},
  {'id': 109006,
   'question': 'Who developed #1 ?',
   'answer': 'Walt Disney',
   'paragraph_support_idx': None}],
 'answer': 'Walt Disney',
 'answer_aliases': [],
 'answerable': False}

This is but one entry in the jsonl file. Although it has an answer 'answer': 'Walt Disney', there is not enough supporting evidence within the 20 accompanying paragraph to substantiate that and so it has a label of 'answerable': False. Each paragraph has a is_supporting label that is to be used in evaluating the pipeline’s ability to not only use the information found in these paragraphs, but to also classify these paragraphs as being supporting elements.

And furthermore, here are examples of a couple of gold paragraphs from another question, where is_supporting == True. Here you can witness for yourself the necessary connection between Lloyd Dane and the county of his birthplace. Just one of the paragraphs by itself wouldn’t be enough to make that connection:

Code
display(js_list[-1]['question'])

p_list = []
for paragraph in js_list[-1]['paragraphs']:
    if paragraph.get('is_supporting', False):
        p_list.append(paragraph)
pprint(p_list)
"Which county does Lloyd Dane's birthplace belong to?"
[{'idx': 3,
  'is_supporting': True,
  'paragraph_text': 'Lloyd Dane (August 19, 1925 – December 11, 2015) was a '
                    'NASCAR Grand National Series driver from Eldon, Missouri. '
                    'He participated part-time in the 1951 and 1954 to 1964 '
                    'seasons, capturing four wins, all in his own car. Two of '
                    "Dane's wins came during the 1956 season, when he finished "
                    'a career best 23rd in points.',
  'title': 'Lloyd Dane'},
 {'idx': 11,
  'is_supporting': True,
  'paragraph_text': 'Eldon is a city in Miller County, Missouri, United '
                    'States, located thirty miles southwest of Jefferson City. '
                    'The population was 4,567 at the 2010 census.',
  'title': 'Eldon, Missouri'}]

3 Finale(more of a cliffhanger)

That’s it for this initial explanatory and exploratory chapter. In the next post, we’ll dive into constructing knowledge graphs from the provided paragraphs used to answer the questions or deem them unanswerable.

Onwards to part one >>