Abstract

This post introduces you to the dataset and prediction format we will be using, the tokenization of the questions, the process of choosing a model, and the high-level tasks necessary to build a knowledge graph from text.

On the last episode of: Don’t RAG on Knowledge Graphs(Or Do): RAG, Knowledge Graphs, and Benchmarks – Part Zero:

RAG is used to ground LLMs when there are strict sourcing requirements
Knowledge graphs have been of great utility in information management across organizations, but not without their problems. They are a very potent tool when coupled with LLMs.
The MuSiQue benchmark combines many previous RAG benchmarks, with a variable multi-hop answerable/unanswerable dataset.

1 What are we predicting?

First and foremost, if we want to build a knowledge graph to assist us with a certain task, we want to ascertain exactly what the output at the end of this pipeline should look like.

1.1 Data and Prediction Datasets

By ‘predictions’, we mean ‘the answers’ and any other expected outputs. It’s a vestige of machine learning vernacular.

Fortunately for us, the fine folks who’ve created the MuSiQue benchmark have made it simple. They’ve ran several models on the dataset and used them for evaluations. We can find the generated predictions in their github repo; this will give us the starting point we need. Lets first look at an example of the input, provided in the data folder(again, from the repo). Note that there is a question, an answer, an answerable flag and a bunch of paragraphs marked whether they support the answer.

Code

with jsonlines.open(musique_dir + '/data/musique_full_v1.0_train.jsonl') as reader:
    lines = [reader.read() for _ in range(1000)]
display(Markdown('**Line Example**'), pprint(lines[1]))

{'answer': 'north',
 'answer_aliases': ['North', 'N'],
 'answerable': True,
 'id': '2hop__269805_135710',
 'paragraphs': [{'idx': 0,
                 'is_supporting': False,
                 'paragraph_text': 'Milton F. Pavlic (1909–1942) was a United '
                                   'States Navy officer killed in action '
                                   'during World War II for whom a U.S. Navy '
                                   'high-speed transport was named.',
                 'title': 'Milton F. Pavlic'},
                {'idx': 1,
                 'is_supporting': False,
                 'paragraph_text': 'Osmund Holm-Hansen (also known as Oz '
                                   'Holm-Hansen) is a Norwegian-born American '
                                   'scientist, for whom Mount Holm-Hansen, in '
                                   'Antarctica is named. A plant physiologist '
                                   'by training, from 1962 Holm-Hansen was the '
                                   'head of polar research at the Scripps '
                                   'Institution of Oceanography.',
                 'title': 'Osmund Holm-Hansen'},
                {'idx': 2,
                 'is_supporting': False,
                 'paragraph_text': '"Sapphire Princess" was built in Japan by '
                                   'Mitsubishi Heavy Industries, the second '
                                   'Princess Cruises ship to be built in a '
                                   'Japanese shipyard. Her only sister ship is '
                                   '"Diamond Princess", with whom she swapped '
                                   'names during construction.',
                 'title': 'Sapphire Princess'},
                {'idx': 3,
                 'is_supporting': False,
                 'paragraph_text': 'Lake Pontchartrain is named for Louis '
                                   'Phélypeaux, comte de Pontchartrain. He was '
                                   'the French Minister of the Marine, '
                                   'Chancellor, and Controller-General of '
                                   "Finances during the reign of France's "
                                   '"Sun King", Louis XIV, for whom the colony '
                                   'of "La Louisiane" was named.',
                 'title': 'Lake Pontchartrain'},
                {'idx': 4,
                 'is_supporting': False,
                 'paragraph_text': 'Henry Crater is a large crater in the '
                                   'Arabia quadrangle of Mars, located at '
                                   '10.9° north latitude and 23.3° east '
                                   'longitude. It is in diameter and was named '
                                   'after the brothers Paul Henry and Prosper '
                                   'Henry, both of whom were French telescope '
                                   'makers and astronomers.',
                 'title': 'Henry (Martian crater)'},
                {'idx': 5,
                 'is_supporting': False,
                 'paragraph_text': 'Where Dead Voices Gather is a book by Nick '
                                   'Tosches. It is, in part, a biography of '
                                   'Emmett Miller, one of the last minstrel '
                                   'singers. Just as importantly, it depicts '
                                   "Tosches' search for information about "
                                   'Miller, about whom he initially wrote in '
                                   'his book "Country: The Twisted Roots of '
                                   'Rock and Roll". It is also a study of '
                                   'minstrelsy and its connection to American '
                                   'folk music, country music, the blues and '
                                   'ultimately, rock and roll. In that way, it '
                                   'is a companion volume to his other books '
                                   'of music journalism, "Country" and "Unsung '
                                   'Heroes of Rock N\' Roll".',
                 'title': 'Where Dead Voices Gather'},
                {'idx': 6,
                 'is_supporting': True,
                 'paragraph_text': 'Norway has a total area of and a '
                                   'population of 5,312,300 (as of August '
                                   '2018). The country shares a long eastern '
                                   'border with Sweden (1,619 km or 1,006\xa0'
                                   'mi long). Norway is bordered by Finland '
                                   'and Russia to the north-east, and the '
                                   'Skagerrak strait to the south, with '
                                   'Denmark on the other side. Norway has an '
                                   'extensive coastline, facing the North '
                                   'Atlantic Ocean and the Barents Sea. The '
                                   "maritime influence also dominates Norway's "
                                   'climate with mild lowland temperatures on '
                                   'the sea coasts, whereas the interior, '
                                   'while colder, also is a lot milder than '
                                   'areas elsewhere in the world on such '
                                   'northerly latitudes. Even during polar '
                                   'night in the north, temperatures above '
                                   'freezing are commonplace on the coastline. '
                                   'The maritime influence brings high '
                                   'rainfall and snowfall to some areas of the '
                                   'country.',
                 'title': 'Norway'},
                {'idx': 7,
                 'is_supporting': False,
                 'paragraph_text': 'The Hireling Shepherd (1851) is a painting '
                                   'by the Pre-Raphaelite artist William '
                                   'Holman Hunt. It represents a shepherd '
                                   'neglecting his flock in favour of an '
                                   'attractive country girl to whom he shows a '
                                   "death's-head hawkmoth. The meaning of the "
                                   'image has been much debated.',
                 'title': 'The Hireling Shepherd'},
                {'idx': 8,
                 'is_supporting': False,
                 'paragraph_text': 'Naissa Mosque is a mosque in Qardaha, '
                                   'along the Syrian coast. It was built in '
                                   '1989 by architect Abdul Rahman Naassan, '
                                   'and funded by the mother of former '
                                   'president Hafez al-Assad, Naissa '
                                   'Assad—after whom the mosque was named. The '
                                   'state funeral of Hafez al-Assad was '
                                   'observed at the mosque.',
                 'title': 'Naissa Mosque'},
                {'idx': 9,
                 'is_supporting': False,
                 'paragraph_text': 'The quick German victory over the French '
                                   'stunned neutral observers, many of whom '
                                   'had expected a French victory and most of '
                                   'whom had expected a long war. The '
                                   'strategic advantages possessed by the '
                                   'Germans were not appreciated outside '
                                   'Germany until after hostilities had '
                                   'ceased. Other countries quickly discerned '
                                   'the advantages given to the Germans by '
                                   'their military system, and adopted many of '
                                   'their innovations, particularly the '
                                   'General Staff, universal conscription and '
                                   'highly detailed mobilization systems.',
                 'title': 'Franco-Prussian War'},
                {'idx': 10,
                 'is_supporting': True,
                 'paragraph_text': 'Tveitsund is a village in Nissedal '
                                   'municipality, Norway. The urban area '
                                   'Tveitsund, which consists of Tveitsund and '
                                   'Treungen, has a population of 361.',
                 'title': 'Tveitsund'},
                {'idx': 11,
                 'is_supporting': False,
                 'paragraph_text': 'John Francis Sheehan (1910–1942) was a '
                                   'United States Navy sailor killed in action '
                                   'during World War II for whom a destroyer '
                                   'escort was named during the war.',
                 'title': 'John Francis Sheehan'},
                {'idx': 12,
                 'is_supporting': False,
                 'paragraph_text': 'Holmes Summit is a peak rising to , the '
                                   'highest elevation in the Read Mountains of '
                                   'the Shackleton Range in Antarctica. It was '
                                   'photographed from the air by the U.S. Navy '
                                   'in 1967 and was surveyed by the British '
                                   'Antarctic Survey in the period 1968–71. In '
                                   'association with the names of geologists '
                                   'grouped in this area, it was named by the '
                                   'UK Antarctic Place-Names Committee in 1971 '
                                   'after Professor Arthur Holmes, after whom '
                                   'the Holmes Hills in Palmer Land were also '
                                   'named.',
                 'title': 'Holmes Summit'},
                {'idx': 13,
                 'is_supporting': False,
                 'paragraph_text': ', better known by her pen name is a '
                                   'Japanese manga artist. She is married to '
                                   'fellow manga artist Tatsuneko, from whom '
                                   'he took the name of . She is a graduate of '
                                   'Mita Senior High School, Tokyo. She '
                                   'currently lives in Setagaya, Tokyo with '
                                   'her husband and daughter.',
                 'title': 'Yun Kōga'},
                {'idx': 14,
                 'is_supporting': False,
                 'paragraph_text': 'The Book of Proper Names () is a Belgian '
                                   'novel by Amélie Nothomb. It was first '
                                   'published in 2002. It is a romanticized '
                                   'account of the life of the singer RoBERT, '
                                   'whom Nothomb became acquainted with as an '
                                   'avid admirer of her songs.',
                 'title': 'The Book of Proper Names'},
                {'idx': 15,
                 'is_supporting': False,
                 'paragraph_text': '653 Berenike is a main-belt asteroid '
                                   'discovered on November 27, 1907, by Joel '
                                   'Hastings Metcalf at Taunton, '
                                   'Massachusetts. It is named after Berenice '
                                   'II of Egypt, after whom the constellation '
                                   'Coma Berenices is also named.',
                 'title': '653 Berenike'},
                {'idx': 16,
                 'is_supporting': False,
                 'paragraph_text': 'orbiting the Sun. It was discovered on 21 '
                                   'February 1906 by August Kopff from '
                                   'Heidelberg. Kopff named the asteroid after '
                                   'a female English student with whom he was '
                                   'acquainted.',
                 'title': '596 Scheila'},
                {'idx': 17,
                 'is_supporting': False,
                 'paragraph_text': 'William M. Hobby (1899–1942), was a United '
                                   'States Navy officer killed in action '
                                   'during World War II for whom a U.S. Navy '
                                   'ship was named.',
                 'title': 'William M. Hobby'},
                {'idx': 18,
                 'is_supporting': False,
                 'paragraph_text': 'The Alma Grace McDonough Health and '
                                   'Recreation Center is a 2,200 seat '
                                   'multipurpose arena and recreation facility '
                                   'on the campus of Wheeling Jesuit '
                                   'University in Wheeling, West Virginia. The '
                                   'building was constructed thanks to a gift '
                                   'from Alma Grace McDonough, whom the '
                                   'building is named after.',
                 'title': 'Alma Grace McDonough Health and Recreation Center'},
                {'idx': 19,
                 'is_supporting': False,
                 'paragraph_text': 'Émile Bertrand (1844–1909) was a French '
                                   'mineralogist, in honour of whom '
                                   'bertrandite was named by Alexis Damour. He '
                                   'also gave his name to the "Bertrand lens" '
                                   'or phase telescope.',
                 'title': 'Émile Bertrand'}],
 'question': 'What is the country where Nissedal is located named after?',
 'question_decomposition': [{'answer': 'Norway',
                             'id': 269805,
                             'paragraph_support_idx': 10,
                             'question': 'Nissedal >> country'},
                            {'answer': 'north',
                             'id': 135710,
                             'paragraph_support_idx': 6,
                             'question': 'The #1 was named for whom?'}]}

Line Example

None

Looking at a snippet of the predictions below, we see that only four factors are necessary, the id - which matches the question id, the answer - which is the answer to the question, the answerable flag - which is a boolean indicating whether the question can be answered, and the supporting facts - which are the paragraphs that support the answer.

Code

display(Markdown('**Examples of predictions**'))
with jsonlines.open(musique_dir + 'predictions/musique_ans_v1.0_dev_end2end_model_predictions.jsonl', 'r') as file:
    for i in range(5):
        display(pprint(file.read()))

Examples of predictions

{'id': '2hop__460946_294723',
 'predicted_answer': 'Jennifer Garner',
 'predicted_answerable': True,
 'predicted_support_idxs': [0, 10]}

None

{'id': '2hop__252311_366220',
 'predicted_answer': 'Steven Spielberg',
 'predicted_answerable': True,
 'predicted_support_idxs': [10, 18]}

None

{'id': '2hop__701895_752697',
 'predicted_answer': 'Cypriot part was merged into the Bank of Cyprus '
                     '(including insured deposits under 100,000 Euro) and the '
                     "'bad' part or legacy entity holds all the overseas "
                     'operations as well as uninsured deposits above 100,000 '
                     'Euro, old shares and bonds. The uninsured depositors '
                     'were subject to a bail-in and became the new '
                     'shareholders of the legacy entity. As at May 2017, the '
                     'legacy entity is one of the largest shareholders of Bank '
                     'of Cyprus with 4.8% but does not hold a board seat. All '
                     'the overseas operations, of the now defunct Cyprus '
                     'Popular Bank, are also held by the legacy entity, until '
                     'they are sold by the Special Administrator, at first Ms '
                     'Andri Antoniadou, who ran the legacy entity for two '
                     'years, from March 2013 until 3 March 2015. She tendered '
                     'her resignation due to disagreements, with the Governor '
                     'of the Central Bank of Cyprus and the Central Bank Board '
                     'members, who amended the lawyers of the legacy entity, '
                     'without consulting her. Veteran banker Chris [[PP]] The '
                     'Ciudad Deportiva ("Sports City") is a sports complex in '
                     'Nuevo Laredo, Mexico. It is home to the Tecolotes de '
                     'Nuevo Laredo Mexican Baseball League team and the Toros '
                     'de Nuevo Laredo Mexican professional basketball team '
                     'from the Liga Nacional de Baloncesto Profesional. The '
                     "Ciudad Deportiva's Estadio Nuevo Laredo (baseball park) "
                     'can seat up to 12,000 fans at a baseball game and the '
                     'Nuevo Laredo Multidisciplinary Gymnasium can seat 4,000 '
                     'fans at a basketball game. [[PP]] Juan Carlos Espinoza '
                     'Mercado (born 23 July 1987 in Machala) is an Ecuadorian '
                     'professional football player who has played for '
                     'Ecuadorian club Liga Deportiva Universitaria de Loja and '
                     'in 2010 he joined Peruvian club Juan Aurich. [[PP]] '
                     'Estadio Unión Tarma is a multi-use stadium in Tarma, '
                     'Peru. It is currently used mostly for football matches '
                     'and is the home stadium of Asociación Deportiva Tarma of '
                     'the Copa Perú. The stadium holds 9,000 spectators. '
                     '[[PP]] A Nigerian State is a federated political entity, '
                     'which shares sovereignty with the Federal Government of '
                     'Nigeria, There are 36 States in Nigeria, which are bound '
                     'together by a federal agreement. There is also a '
                     'territory called the Federal Capital Territory (FCT), '
                     'which is not a state, but a territory, under the direct '
                     'control of the Federal Government. The States are '
                     'further divided into a total of 774 Local Government '
                     'Areas. Under the Nigerian Constitution, states have the '
                     'power to ratify constitutional amendments. [[PP]] Ofu '
                     'Airport is a public airport located one mile (2 km) '
                     'southeast of the village of Ofu on the island of Ofu in '
                     'American Samoa, an unincorporated territory of the '
                     'United States. This airport is publicly owned by '
                     'Government of American Samoa. [[PP]] The Díaz '
                     'administration made political decisions and took legal '
                     'measures that allowed the elite throughout Mexico',
 'predicted_answerable': True,
 'predicted_support_idxs': [11, 16, 18]}

None

{'id': '2hop__259228_793698',
 'predicted_answer': 'Fairfield, Connecticut. Its main offices are located at '
                     '30 Rockefeller Plaza at Rockefeller Center in New York '
                     'City, known now as the Comcast Building. It was formerly '
                     'known as the GE Building for the prominent GE logo on '
                     "the roof; NBC's headquarters and main studios are also "
                     'located in the building. Through its RCA subsidiary, it '
                     'has been associated with the center since its '
                     'construction in the 1930s. GE moved its corporate '
                     'headquarters from the GE Building on Lexington Avenue to '
                     'Fairfield in 1974. [[PP]] The lander is named after the '
                     'Philae obelisk, which bears a bilingual inscription and '
                     'was used along with the Rosetta Stone to decipher '
                     'Egyptian hieroglyphs. "Philae" was monitored and '
                     "operated from DLR's Lander Control Center in Cologne",
 'predicted_answerable': True,
 'predicted_support_idxs': [2, 10, 14]}

None

{'id': '2hop__481349_302087',
 'predicted_answer': 'Bombardier Inc. the former CRJ100 and CRJ200 series are '
                     'no longer in production but still in active airline '
                     'service, while the more recent CRJ700, CRJ900 and '
                     'CRJ1000 series are in production and in service. [[PP]] '
                     'Products offered through the Great Value brand are often '
                     'claimed to be as good as national brand offerings, but '
                     'are typically sold at a lower price because of lower '
                     'marketing and advertising expense. As a house or store '
                     'brand, the Great Value line does not consist of goods '
                     'produced by Walmart, but is a labeling system for items '
                     'manufactured and packaged by a number of agricultural '
                     'and food corporations, such as ConAgra, Sara Lee which, '
                     'in addition to releasing products under its own brands '
                     'and exclusively for Walmart, also manufactures and '
                     'brands foods for a variety of other chain stores. Often, '
                     'this labeling system, to the dismay of consumers, does '
                     'not list location of manufacture of the product. Wal - '
                     'Mart contends that all Great Value products are produced '
                     'in the United States. Otherwise, the country of origin '
                     'would be listed. [[PP]] On June 11, 2006, the British '
                     'tabloid The Mail on Sunday reported that iPods are '
                     'mainly manufactured by workers who earn no more than '
                     'US$50 per month and work 15-hour shifts. Apple '
                     'investigated the case with independent auditors and '
                     "found that, while some of the plant's labour practices "
                     "met Apple's Code of Conduct, others did not: employees "
                     'worked over 60 hours a week for 35% of the time, and '
                     'worked more than six consecutive days for 25% of the '
                     'time. [[PP]] The EMD E6 was a , A1A-A1A, passenger train '
                     'locomotive manufactured by Electro-Motive Corporation, '
                     'and its corporate successor, General Motors',
 'predicted_answerable': True,
 'predicted_support_idxs': [5, 10]}

None

1.2 Inputs and Outputs

In essence, our pipeline primarily needs to take the question and the paragraphs and spit out:

Whether the question is answerable.
Which paragraphs contribute to the question’s answer.
The answer.

Now we’re beginning to see why this is a difficult task. Nevertheless – onwards.

2 Tokens and Tokens and More Tokens

These are the values as of this writing. They may change in the future.

There has been a lot of hype regarding enormous input context windows, which has led to articles such as RAG is dead, long live RAG. When we refer to these huge context windows, we’re primarily referring to the input and not the output. Google Gemini 1.5 has a 1 million context window, however, the allowed output is only 8192 tokens. Similarly, OpenAI’s GPT-4 models have a 128k context window and only a 4096 token output.

Figure 1: Context Sizes of GPT-4 and Gemini 1.5 and their max output sizes

Initially, large context windows were untenable as they ate resources like fat kids eat cake – they were also unreliable, where the model would only remember the beginning and end of the input, while generally ‘forgetting’ the middle. This has improved over time, to the point of near perfect performance with these enormous context windows. From the Gemini 1.5 whitepaper, we see their needle-in-a-haystack(NiaH) performance to be stellar. It is able to locate key phrases and words within huge contexts. They use a 10 million token context window for their stress testing.

Figure 2: Gemini 1.5 Needle in a Haystack Performance (It is a multimodal model, so it is able to take audio and video as inputs as well)

While very impressive, many argue that NiaH is purely a vanity metric and that it in order to test the context window, you need real-world evaluations and the ability to test reasoning across this mass of data.

For shits and giggles, we’ll see how many tokens we’re working with here.

But first…

2.0.1 Tokens?

What the heck is a token anyways? Please skip this section if you’re a token master – or don’t if you fancy my prose, up to you.

I’m not going to describe byte-pair encodings(BPE) at length, but I will try to prime your intuition a bit. All current performant foundational models use BPE for their model inputs, so this should be relevant for maybe another, y’know, three hours(I jest). OpenAI offers a fun little token visualizer tool.

Essentially, the tokens are determined by feeding a large corpus of data into an algorithm that is meant to extract a set amount of unique tokens by taking the most common sequences of words and iterating over them until the uniqueness constraint is satisfied. If we look at Fig. 3, we see that ***** is a single token, while [ is also a single token with its own unique numerical designation within the LLM. Some sequences of characters are commonly used, and so it makes sense to treat them as one token. Also, notice that the preceding spaces around the words are treated as part of the word token. Smiley faces are common enough that they also have earned their own token(at least that’s my interpretation of it). You can also see that token strings can be part of larger token strings as we see between ** and *****. Both are completely unique tokens to the model.

When you’re feeding strings into the model, they are split off into numbered segments, which are then matched to their bit-encoding(e.g. 1010101111000), which goes into the model.

2.0.2 Token Measurement

Different models use different tokenization strategies(but the same technique) with varying datasets, so we’ll focus on the publicly available algorithms. tiktoken is an OpenAI tool you can use to determine the token-representation existing within any string of text.

Code

import tiktoken

display(Markdown('**Token models**'), tiktoken.model.MODEL_TO_ENCODING)
tokenizer = tiktoken.encoding_for_model('gpt-3.5-turbo')
display(Markdown('**Tokenizer we are using**'), tokenizer)

Token models

{'gpt-4': 'cl100k_base',
 'gpt-3.5-turbo': 'cl100k_base',
 'gpt-3.5': 'cl100k_base',
 'gpt-35-turbo': 'cl100k_base',
 'davinci-002': 'cl100k_base',
 'babbage-002': 'cl100k_base',
 'text-embedding-ada-002': 'cl100k_base',
 'text-embedding-3-small': 'cl100k_base',
 'text-embedding-3-large': 'cl100k_base',
 'text-davinci-003': 'p50k_base',
 'text-davinci-002': 'p50k_base',
 'text-davinci-001': 'r50k_base',
 'text-curie-001': 'r50k_base',
 'text-babbage-001': 'r50k_base',
 'text-ada-001': 'r50k_base',
 'davinci': 'r50k_base',
 'curie': 'r50k_base',
 'babbage': 'r50k_base',
 'ada': 'r50k_base',
 'code-davinci-002': 'p50k_base',
 'code-davinci-001': 'p50k_base',
 'code-cushman-002': 'p50k_base',
 'code-cushman-001': 'p50k_base',
 'davinci-codex': 'p50k_base',
 'cushman-codex': 'p50k_base',
 'text-davinci-edit-001': 'p50k_edit',
 'code-davinci-edit-001': 'p50k_edit',
 'text-similarity-davinci-001': 'r50k_base',
 'text-similarity-curie-001': 'r50k_base',
 'text-similarity-babbage-001': 'r50k_base',
 'text-similarity-ada-001': 'r50k_base',
 'text-search-davinci-doc-001': 'r50k_base',
 'text-search-curie-doc-001': 'r50k_base',
 'text-search-babbage-doc-001': 'r50k_base',
 'text-search-ada-doc-001': 'r50k_base',
 'code-search-babbage-code-001': 'r50k_base',
 'code-search-ada-code-001': 'r50k_base',
 'gpt2': 'gpt2',
 'gpt-2': 'gpt2'}

Tokenizer we are using

<Encoding 'cl100k_base'>

We observe that the latest models are using the cl100k_base tokenization model, which we can assume uses ~100,000 unique tokens. Prior to this, a 50,000 unique token model was used. Also, we instantiate our tokenizer for the next step. Choosing the gpt-4 or gpt-3.5-turbo tokenizer makes no material difference, as they use the same exact tokenization model.

The tokenizer can be used on one of the paragraphs we have to illustrate its token composition.

Code

test_line = lines[1]
test_paragraphs = test_line['paragraphs']

display(Markdown('**Paragraph Example**'), test_paragraphs[0])
test_tokens = tokenizer.encode(test_paragraphs[0]['paragraph_text'])
display(Markdown('**Tokens**'), test_tokens)
display(Markdown('**Number of Tokens**'), len(test_tokens))

Paragraph Example

{'idx': 0,
 'title': 'Milton F. Pavlic',
 'paragraph_text': 'Milton F. Pavlic (1909–1942) was a United States Navy officer killed in action during World War II for whom a U.S. Navy high-speed transport was named.',
 'is_supporting': False}

Tokens

Number of Tokens

Only 39 tokens – nice.

Is this something we can expect from the provided paragraphs in our dataset?

Code

for paragraph in test_paragraphs:
    paragraph_text = paragraph['paragraph_text']
    paragraph_tokens = tokenizer.encode(paragraph_text)
    print(f'Number of tokens in paragraph: {len(paragraph_tokens)}')

Number of tokens in paragraph: 39
Number of tokens in paragraph: 68
Number of tokens in paragraph: 45
Number of tokens in paragraph: 64
Number of tokens in paragraph: 59
Number of tokens in paragraph: 131
Number of tokens in paragraph: 176
Number of tokens in paragraph: 59
Number of tokens in paragraph: 71
Number of tokens in paragraph: 86
Number of tokens in paragraph: 42
Number of tokens in paragraph: 36
Number of tokens in paragraph: 102
Number of tokens in paragraph: 61
Number of tokens in paragraph: 58
Number of tokens in paragraph: 57
Number of tokens in paragraph: 39
Number of tokens in paragraph: 35
Number of tokens in paragraph: 59
Number of tokens in paragraph: 48

Not exactly, but the max length is roughly 176 tokens, so it’s still a fairly small token amount.

3 Motivation for RAG over Large Context Windows

If you’re thinking what I’m thinking, you’ve probably done the head-math and figured that ~2000 tokens can easily fit into a 1.5M token context window with ease, with the only remaining task being some clever prompt engineering.

While this is true, we have to think of cost and scale, as well as veracity. RAG systems tend to be substantially cheaper than context stuffing. This entry by Atai Barkai, illustrates the cost of RAG compared to context stuffing when it comes to a simple benchmark like the previously mentioned NiaH. Context stuffing ends up being 2500% more expensive. According to my calculations, which you can totally trust, that’s a lot of 🥑avocado toast.

On top of the cost-benefit, when we include knowledge graphs, we also gain the power of symbolic representational knowledge as a memory, which neither context stuffing nor vanilla RAG does.

4 Choosing a Model

When selecting a model, we are often in the shoes of Goldilocks, we don’t want it to be too expensive, but we also don’t want it to lack in critical performance where it matters – we usually want that golden middle ground. To obtain that middle ground, combinations of models are usually used. For instance, a GPT-4 level model would be used for the abstract and high-level thinking, while the lower GPT 3.5 level models would be used for simpler processes that don’t require very high levels of abstraction.

4.1 Jean-Claude Van Damme, tell me a Haiku

What. Just kidding. We’ll be talking about Anthropic’s Claude 3 models. The following chart is from the LMSYS Chatbot Arena where models go head to head in answering questions which are then chosen by users.

Figure 4: Comparison of performance and cost among top models

On the far right, we have GPT-4 and Claude 3 Opus neck to neck as the highest performing models. As of this writing, the latest GPT-4 Turbo model actually overtook Claude 3 Opus. At the very top, we see Claude Haiku, which performs slightly below one of the GPT-4 models, but at an incredibly low cost. All of the Claude 3 models have a 200,000 token window and a 4096 token output – this is comparable to the 128,000 GPT-4 token window with a 8196 token output. Claude 3 Haiku will be model we’ll be using. If there are any hurdles with that particular model, it will not be too difficult to pivot by simply changing the endpoint to GPT 3.5 or GPT 4.

Here is Claude 3 Haiku writing a haiku about itself:

Artificial mind,
Seeking to understand, learn,
Serve with empathy.

Are you impressed yet? ^{_{Maybe a little scared?}}

Although Claude had the very large context window months before GPT-4, the jury is out on whether it has been useful and robust enough for production.

4.1.1 Claude Tokenization

Unfortunately, Anthropic has not released a tokenizer that we can use, however, it is generally safe(famous last words lol) to assume that it is quite similar to the OpenAI one. Here, someone has attempted to reverse engineer it by counting the token amounts of the generations streamed to you.

But we’re not going to do that.

5 Creating a Knowledge Graph

From the previous post, you may remember that we spoke of combining a vector store along with a knowledge graph in order to take advantage of the specific multiplicity of that combination. Because generating a workflow for knowledge graph creation is an undertaking in its own right, we’ll first want to build a knowledge graph, and then attach the logic for using it along with a vector store. For descriptive purposes, this is much easier and less convoluted than it would be.

5.1 Strategy

5.1.1 Sliding Windows

To answer the questions asked in the MuSiQue benchmark, we will create a unique knowledge graph for every individual question, consisting out of the twenty provided paragraphs. Each paragraph can be arbitrarily divided into multiple chunks of text which the LLM can take as input into its context.

Figure 5: Each question contains multiple paragraphs, and each paragraph is made out of multiple text chunks.

We can use a sliding window to process the chunks of text that the paragraphs are composed of. There are numerous ways to insert variable amounts of text into the context of an LLM, but I’ll introduce the two basic approaches. You can do so with a sliding window that takes in one chunk of text after the other, or you can use a sliding window with some overlap. We’ll use the latter strategy, as it may help with continuity of the model’s understanding of the text. As the window slides across the text, we want to generate the nodes and edges of the knowledge graph.

When I say ‘nodes and edges’, I also mean any attributes thereof

Figure 6: Sliding window with overlap tends to be the standard approach when inserting text into LLMs

5.1.2 Knowledge-Stuffing

Connections are the bread and butter of knowledge graphs. If our LLM is producing nodes and edges only from our limited context window, it appears that we’re missing out on the connectivity benefit of knowledge graphs. To increase the connectivity of our knowledge graph, we can inform our LLM of previous nodes and edges it has created by passing them into the context of the LLM. Conveniently, this gives me the opportunity to introduce our composition of the context we’ll be using.

Figure 7: We provide the LLM with the system prompt, text chunks, and previously generated nodes and edges

Inside of the prompt we have our:

System prompt: Contains the necessary instructions for priming the model(e.g. “you are a brave and beautiful graph creator”), as well as formatting in the case where we want JSON returned to us that represents the nodes and edges, and anything else we’ll need.
Previously generated nodes and edges: By knowing the previously generated nodes and edges, we can use them to update or create new nodes and edges that may or may not be related.
Text chunks: The text from the paragraphs which the LLM will be converting to nodes and edges.

Unless we’ll be including all of the nodes and edges into the prompt, it still feels a bit limited. Technically, we can just shove all of those connections into the prompt, as there’s ample space with our huge 200,000 token limit, but we want this method to generalize and scale to tasks outside of this particular dataset.

5.1.3 Letting the LLM Loose

Consider the knowledge graph obtained after we process the 20 paragraphs pertaining to one question using the previously discussed method. We’d get something like:

Figure 8: Sparsely Connected Knowledge Graph

The facts we obtain from the text chunks will likely be connected in fairly atomic clusters as there wouldn’t be great continuity, even with passing some of the previously computed nodes and edges into our context window. One way to fix this would be to feed random sets of nodes(and/or edges) to the LLM and let it generate new connections between the nodes.

This can be done in one of two ways(more, actually):

Push the nodes and edges(and attributes) into the context window and tell the model to blindly make associations based on that information alone.
Along with the nodes and edges, push the segments of text that contributed to the creation of the nodes and edges alongside them. This gives the LLM more grounding and reduces hallucinations.

We’ll focus on the latter, as it pairs well with the vector store approach we will be discussing later.

6 Wrapping up

To be perfectly honest, I was intending to get into coding the knowledge graph creation pipeline in this post, I even had to change the title and abstract before publishing. Fortunately, there’s plenty here to mull over.

That’ll be happening in the next one – pinky promise. I’m hoping that this was a good amount of background and theory behind what we’ll be doing next.

You can reach out to me if you have any questions via X or email.

Part Deux >>