Abstract

This section focuses on constructing a workflow to answer questions found in the MuSiQue dataset. After a long an arduous road of constructing a knowledge graph, adding vector storage, and linking the two, we now have a system that can answer questions with a high degree of accuracy and efficiency(or so we hope).

On the last episode of: Don’t RAG on Knowledge Graphs(Or Do) Benchmarking: Adding a Vector Database – Part Three:

Out of the many available options for vector DBs, we’re using Chroma due to its simplicity and ease of use – a very powerful plug ’n play option.
The nodes in our knowledge graph are linked to generated embeddings.
When we ask a question, we can find the semantically related nodes by generating an embedding for the question and the running a similarity search using a metric like cosine distance.

1 Overview

Below is a flow-charted summary of what this post will be focusing on. The vector database and knowledge graph are generated in previous posts.

Illustrated flow from question to structured answer.

2 Question

To be or not to be? 🫣

Our purpose here is to answer a question provided a set of paragraphs, and provide the supporting evidence for it.

As a brief reminder, lets peek into a single entry of the MuSiQue dataset used in our previous exploration:

Code

lines[-2]

{'id': '2hop__604134_131944',
 'paragraphs': [{'idx': 0,
   'title': 'Commonwealth of the Philippines',
   'paragraph_text': "The Commonwealth of the Philippines (; ) was the administrative body that governed the Philippines from 1935 to 1946, aside from a period of exile in the Second World War from 1942 to 1945 when Japan occupied the country. It replaced the Insular Government, a United States territorial government, and was established by the Tydings–McDuffie Act. The Commonwealth was designed as a transitional administration in preparation for the country's full achievement of independence.",
   'is_supporting': False},
  {'idx': 1,
   'title': 'Lake Oesa',
   'paragraph_text': 'Lake Oesa is a body of water located at an elevation of 2,267m (7438 ft) in the mountains of Yoho National Park, near Field, British Columbia, Canada.',
   'is_supporting': False},
  {'idx': 2,
   'title': 'Arafura Swamp',
   'paragraph_text': 'The Arafura Swamp is a large inland freshwater wetland in Arnhem Land, in the Top End of the Northern Territory of Australia. It is a near pristine floodplain with an area of that may expand to by the end of the wet season, making it the largest wooded swamp in the Northern Territory and, possibly, in Australia. It has a strong seasonal variation in depth of water. The area is of great cultural significance to the Yolngu people, in particular the Ramingining community. It was the filming location for the film "Ten Canoes".',
   'is_supporting': False},
  {'idx': 3,
   'title': 'Wapizagonke Lake',
   'paragraph_text': 'The Wapizagonke Lake is one of the bodies of water located the sector "Lac-Wapizagonke", in the city of Shawinigan, in the La Mauricie National Park, in the region of Mauricie, in Quebec, in Canada.',
   'is_supporting': False},
  {'idx': 4,
   'title': 'Khabarovsky District',
   'paragraph_text': 'Khabarovsky District () is an administrative and municipal district (raion), one of the seventeen in Khabarovsk Krai, Russia. It consists of two unconnected segments separated by the territory of Amursky District, which are located in the southwest of the krai. The area of the district is . Its administrative center is the city of Khabarovsk (which is not administratively a part of the district). Population:',
   'is_supporting': False},
  {'idx': 5,
   'title': 'Silver Lake (Harrisville, New Hampshire)',
   'paragraph_text': 'Silver Lake is a water body located in Cheshire County in southwestern New Hampshire, United States, in the towns of Harrisville and Nelson. Water from Silver Lake flows via Minnewawa Brook and The Branch to the Ashuelot River, a tributary of the Connecticut River.',
   'is_supporting': False},
  {'idx': 6,
   'title': 'Hyderabad',
   'paragraph_text': 'The jurisdictions of the city\'s administrative agencies are, in ascending order of size: the Hyderabad Police area, Hyderabad district, the GHMC area ("Hyderabad city") and the area under the Hyderabad Metropolitan Development Authority (HMDA). The HMDA is an apolitical urban planning agency that covers the GHMC and its suburbs, extending to 54 mandals in five districts encircling the city. It coordinates the development activities of GHMC and suburban municipalities and manages the administration of bodies such as the Hyderabad Metropolitan Water Supply and Sewerage Board (HMWSSB).',
   'is_supporting': False},
  {'idx': 7,
   'title': 'San Juan, Puerto Rico',
   'paragraph_text': "San Juan is located along the north - eastern coast of Puerto Rico. It lies south of the Atlantic Ocean; north of Caguas and Trujillo Alto; east of and Guaynabo; and west of Carolina. The city occupies an area of 76.93 square miles (199.2 km), of which, 29.11 square miles (75.4 km) (37.83%) is water. San Juan's main water bodies are San Juan Bay and two natural lagoons, the Condado and San José.",
   'is_supporting': False},
  {'idx': 8,
   'title': 'States of Germany',
   'paragraph_text': 'Local associations of a special kind are an amalgamation of one or more Landkreise with one or more Kreisfreie Städte to form a replacement of the aforementioned administrative entities at the district level. They are intended to implement simplification of administration at that level. Typically, a district-free city or town and its urban hinterland are grouped into such an association, or Kommunalverband besonderer Art. Such an organization requires the issuing of special laws by the governing state, since they are not covered by the normal administrative structure of the respective states.',
   'is_supporting': False},
  {'idx': 9,
   'title': 'Norfolk Island',
   'paragraph_text': "Norfolk Island is located in the South Pacific Ocean, east of the Australian mainland. Norfolk Island is the main island of the island group the territory encompasses and is located at 29°02′S 167°57′E\ufeff / \ufeff29.033°S 167.950°E\ufeff / -29.033; 167.950. It has an area of 34.6 square kilometres (13.4 sq mi), with no large-scale internal bodies of water and 32 km (20 mi) of coastline. The island's highest point is Mount Bates (319 metres (1,047 feet) above sea level), located in the northwest quadrant of the island. The majority of the terrain is suitable for farming and other agricultural uses. Phillip Island, the second largest island of the territory, is located at 29°07′S 167°57′E\ufeff / \ufeff29.117°S 167.950°E\ufeff / -29.117; 167.950, seven kilometres (4.3 miles) south of the main island.",
   'is_supporting': False},
  {'idx': 10,
   'title': 'Perm',
   'paragraph_text': 'Perm (;) is a city and the administrative centre of Perm Krai, Russia, located on the banks of the Kama River in the European part of Russia near the Ural Mountains.',
   'is_supporting': True},
  {'idx': 11,
   'title': 'Zvezda Stadium',
   'paragraph_text': 'Star (Zvezda) Stadium (), until 1991 Lenin Komsomol Stadium (), is a multi-use stadium in Perm, Russia. It is currently used mostly for football matches and is the home ground of FC Amkar Perm. The stadium holds 17,000 people and was opened on June 5, 1969.',
   'is_supporting': True},
  {'idx': 12,
   'title': 'Paea',
   'paragraph_text': 'Paea is a commune in the suburbs of Papeete in French Polynesia, an overseas territory of France in the southern Pacific Ocean. Paea is located on the island of Tahiti, in the administrative subdivision of the Windward Islands, themselves part of the Society Islands. At the 2017 census it had a population of 13,021.',
   'is_supporting': False},
  {'idx': 13,
   'title': 'Potamogeton amplifolius',
   'paragraph_text': 'Potamogeton amplifolius, commonly known as largeleaf pondweed or broad-leaved pondweed, is an aquatic plant of North America. It grows in water bodies such as lakes, ponds, and rivers, often in deep water.',
   'is_supporting': False},
  {'idx': 14,
   'title': 'Biysky District',
   'paragraph_text': "Biysky District () is an administrative and municipal district (raion), one of the fifty-nine in Altai Krai, Russia. It is located in the east of the krai and borders with Zonalny, Tselinny, Soltonsky, Krasnogorsky, Sovetsky, and Smolensky Districts, as well as with the territory of the City of Biysk. The area of the district is . Its administrative center is the city of Biysk (which is not administratively a part of the district). District's population:",
   'is_supporting': False},
  {'idx': 15,
   'title': 'Contoocook Lake',
   'paragraph_text': 'Contoocook Lake () is a water body located in Cheshire County in southwestern New Hampshire, United States, in the towns of Jaffrey and Rindge. The lake, along with Pool Pond, forms the headwaters of the Contoocook River, which flows north to the Merrimack River in Penacook, New Hampshire.',
   'is_supporting': False},
  {'idx': 16,
   'title': 'Bogotá',
   'paragraph_text': 'Bogotá (/ ˈboʊɡətɑː /, / ˌbɒɡəˈtɑː /, / ˌboʊ - /; Spanish pronunciation: (boɣoˈta) (listen)), officially Bogotá, Distrito Capital, abbreviated Bogotá, D.C., and formerly known as Santafé de Bogotá between 1991 and 2000, is the capital and largest city of Colombia, administered as the Capital District, although often thought of as part of Cundinamarca. Bogotá is a territorial entity of the first order, with the same administrative status as the departments of Colombia. It is the political, economic, administrative, industrial, artistic, cultural, and sports center of the country.',
   'is_supporting': False},
  {'idx': 17,
   'title': 'Body water',
   'paragraph_text': "Intracellular fluid (2 / 3 of body water) is fluid contained within cells. In a 72 - kg body containing 40 litres of fluid, about 25 litres is intracellular, which amounts to 62.5%. Jackson's texts states 70% of body fluid is intracellular.",
   'is_supporting': False},
  {'idx': 18,
   'title': 'Territorial waters',
   'paragraph_text': 'Territorial waters or a territorial sea, as defined by the 1982 United Nations Convention on the Law of the Sea, is a belt of coastal waters extending at most 12 nautical miles (22.2 km; 13.8 mi) from the baseline (usually the mean low - water mark) of a coastal state. The territorial sea is regarded as the sovereign territory of the state, although foreign ships (civilian) are allowed innocent passage through it, or transit passage for straits; this sovereignty also extends to the airspace over and seabed below. Adjustment of these boundaries is called, in international law, maritime delimitation.',
   'is_supporting': False},
  {'idx': 19,
   'title': 'Cyprus Popular Bank',
   'paragraph_text': "Cyprus Popular Bank (from 2006 to 2011 known as Marfin Popular Bank) was the second largest banking group in Cyprus behind the Bank of Cyprus until it was 'shuttered' in March 2013 and split into two parts. The 'good' Cypriot part was merged into the Bank of Cyprus (including insured deposits under 100,000 Euro) and the 'bad' part or legacy entity holds all the overseas operations as well as uninsured deposits above 100,000 Euro, old shares and bonds. The uninsured depositors were subject to a bail-in and became the new shareholders of the legacy entity. As at May 2017, the legacy entity is one of the largest shareholders of Bank of Cyprus with 4.8% but does not hold a board seat. All the overseas operations, of the now defunct Cyprus Popular Bank, are also held by the legacy entity, until they are sold by the Special Administrator, at first Ms Andri Antoniadou, who ran the legacy entity for two years, from March 2013 until 3 March 2015. She tendered her resignation due to disagreements, with the Governor of the Central Bank of Cyprus and the Central Bank Board members, who amended the lawyers of the legacy entity, without consulting her. Veteran banker Chris Pavlou who is an expert in Treasury and risk management took over as Special Administrator of the legacy entity in April 2015 until December 2016. The legacy entity is pursuing legal action against former major shareholder Marfin Investment Group.",
   'is_supporting': False}],
 'question': 'What is the body of water by the city where Zvezda stadium is located?',
 'question_decomposition': [{'id': 604134,
   'question': 'Zvezda >> located in the administrative territorial entity',
   'answer': 'Perm',
   'paragraph_support_idx': 11},
  {'id': 131944,
   'question': 'Which is the body of water by #1 ?',
   'answer': 'Kama River',
   'paragraph_support_idx': 10}],
 'answer': 'Kama River',
 'answer_aliases': ['Kama'],
 'answerable': True}

The question being:

What is the body of water by the city where Zvezda stadium is located?.

Simple enough.

The format of the answer is also relevant to us:

{'id': '2hop__252311_366220',
 'predicted_answer': 'Steven Spielberg',
 'predicted_answerable': True,
 'predicted_support_idxs': [10, 18]}

This is taken straight from one of the prediction sets available in MuSiQue’s repo. Steven Spielberg is in fact not a body of water, but a movie director.

Our pipeline’s output needs to include: 1) The answer 2) Whether the question is answerable given the supporting paragraphs. 3) The paragraphs which contain the supporting information to answer the question.

3 Prompting

That’s right, we’re back to prompting, our bread n buttah.

This time, we’ll be feeding the question, instructions, and supporting evidence to the LLM. This will be very similar to us coaxing the LLM to create the knowledge graph in one of the previous posts.

3.1 Prompt Template

First, we need a system message that helps guide our model along, delivers and understanding of the input, and gently coerces it to output an aptly formatted answer.

from langchain_core.messages import SystemMessage

guidance_str = \
"You are the best taker of tests, particularly excelling at \
answering questions based on information provided to you. \
You will be given nodes and edges from a knowledge graph in \
a JSON format and you are expected to answer a question based \
on them. The 'from_node' and 'to_node' fields in the edges correspond \
to the 'connecting_id' fields in the nodes. \
Your output will only be JSON, and nothing more. \
No yapping.\n"

class Answer(BaseModel):
    answerable: bool = Field(..., description="true or false value. Whether or not the answer is answerable based on the provided nodes and edges")
    answer: str = Field(..., description="The answer to the question. Terse and concise.")
    support_idxs: List[int] = Field(..., description="The indices of the nodes that support the answer. From 'paragraph_idx' field")

format_str = f"This JSON Schema is the format you will be using: {json.dumps(Answer.model_json_schema())}"

system_message = SystemMessage(guidance_str + format_str)

The guidance_str lays out what I described above. We also provide the ‘format_str’, which includes a JSON dump of the Answer class schema, brought to you by Pydantic, although this time it’s a bit less convoluted than the one used to create the nodes and edges of our knowledge graph.

In addition to the System message, we also need to add the Human message template to our pipeline which will allow us to pass in the question and evidence.

human_str = "Question: {question}\n\n Supporting Evidence:\n {evidence}"
human_template = HumanMessagePromptTemplate.from_template(human_str)

4 Gathering Evidence

Next up, we need to gather the supporting evidence for our model from our knowledge base(the combination of our knowledge graph and vector store which we created in previous posts).

top_results = collection.query(
    query_texts=[lines[-2]['question']]
    )
top_results

{'ids': [['6e5f08dc-fe95-4b74-884d-dcce8470290a',
   '7b534168-d2e7-498e-9115-5e21d6c638f3',
   '9a4b69f8-749a-4b7c-a3e3-e2db4f3823d1',
   '7c83cf46-05fc-491d-9667-20acf68fe70f',
   '7956f84b-20a8-4836-ae7a-c7311d716cd1',
   '56ae1a37-74f4-486b-b517-34b99027ba36',
   '631d3937-3f47-4598-8f45-bdb90d5eb91f',
   'c7120f30-4152-4e88-bec1-698bfdd2d5e1',
   'd97d057d-2564-427d-9703-e77a61ff58c7',
   'd86a7f75-df06-47f1-a30d-67a921d822bf']],
 'distances': [[1.3786437511444092,
   1.4539598226547241,
   1.4855282306671143,
   1.501915693283081,
   1.5252399444580078,
   1.5319929122924805,
   1.552965760231018,
   1.5618354082107544,
   1.5635420083999634,
   1.5718779563903809]],
 'metadatas': [[None, None, None, None, None, None, None, None, None, None]],
 'embeddings': None,
 'documents': [["{'semantic_id': 'territorial_waters', 'category': 'geographic_area', 'attributes': {'name': 'territorial waters', 'definition': 'a belt of coastal waters extending at most 12 nautical miles (22.2 km; 13.8 mi) from the baseline (usually the mean low-water mark) of a coastal state', 'source': '1982 United Nations Convention on the Law of the Sea'}, 'paragraph_idx': 18}",
   "{'semantic_id': 'territorial_sea', 'category': 'geographic_area', 'attributes': {'name': 'territorial sea', 'definition': 'a belt of coastal waters extending at most 12 nautical miles (22.2 km; 13.8 mi) from the baseline (usually the mean low-water mark) of a coastal state', 'sovereign_territory': True, 'foreign_ship_passage': 'innocent passage through it or transit passage for straits', 'jurisdiction': 'extends to airspace over and seabed below'}, 'paragraph_idx': 18}",
   "{'semantic_id': 'hmwssb', 'category': 'administrative_body', 'attributes': {'name': 'Hyderabad Metropolitan Water Supply and Sewerage Board', 'type': 'water_management'}}",
   "{'semantic_id': 'arafura_swamp', 'category': 'natural_feature', 'attributes': {'name': 'Arafura Swamp', 'type': 'inland freshwater wetland', 'location': {'region': 'Arnhem Land', 'territory': 'Northern Territory', 'country': 'Australia'}, 'size': {'area': {'value': None, 'unit': 'km2'}, 'expansion_during_wet_season': True}, 'description': 'a near pristine floodplain, possibly the largest wooded swamp in the Northern Territory and Australia', 'cultural_significance': 'of great cultural significance to the Yolngu people, in particular the Ramingining community', 'filming_location': 'Ten Canoes'}, 'paragraph_idx': 2}",
   "{'semantic_id': 'san_juan', 'category': 'city', 'attributes': {'name': 'San Juan', 'location': {'country': 'Puerto Rico', 'region': 'north-eastern coast'}, 'borders': {'north': 'Atlantic Ocean', 'south': ['Caguas', 'Trujillo Alto'], 'east': ['Carolina'], 'west': ['Guaynabo']}, 'area': {'value': 76.93, 'unit': 'square miles'}, 'water_bodies': ['San Juan Bay', 'Condado Lagoon', 'San José Lagoon'], 'water_area': {'value': 29.11, 'unit': 'square miles', 'percentage': 37.83}}, 'paragraph_idx': 7}",
   "{'semantic_id': 'contoocook_river', 'category': 'river', 'attributes': {'name': 'Contoocook River', 'flow_direction': 'north', 'outflow_destination': 'merrimack_river'}, 'paragraph_idx': 15}",
   "{'semantic_id': 'ghmc_area', 'category': 'administrative_district', 'attributes': {'name': 'GHMC area', 'jurisdiction_size': 'second_largest', 'alternate_name': 'Hyderabad city'}}",
   "{'semantic_id': 'lake_oesa', 'category': 'natural_feature', 'attributes': {'name': 'Lake Oesa', 'elevation': 2267, 'elevation_unit': 'm', 'location': {'park': 'Yoho National Park', 'city': 'Field', 'province': 'British Columbia', 'country': 'Canada'}}, 'paragraph_idx': 1}",
   "{'semantic_id': 'intracellular_fluid', 'category': 'fluid', 'attributes': {'name': 'intracellular fluid', 'volume': '2/3 of body water', 'amount_in_72_kg_body': '25 litres', 'percentage_of_total_body_fluid': 62.5}, 'paragraph_idx': 17}",
   "{'semantic_id': 'strait', 'category': 'geographic_feature', 'attributes': {'name': 'strait', 'sovereign_territory': True, 'jurisdiction': {'airspace': True, 'seabed': True}}, 'paragraph_idx': 18}"]],
 'uris': None,
 'data': None}

We’re interested in the top 3 results. Why? Why not?

uuid_strs = top_results['ids'][0][:3]
top_uuids = [uuid.UUID(uuid_str, version=4) for uuid_str in uuid_strs]

We get the top UUIDs to then use in conjuction with our UUID: graph index mapping we constructed during the creation of the knowledge graph.

Lets take a gander into the connecting nodes from our top 3 results. Our network graph is directed, meaning that the direction is important(and creates a much easier semantic designation for the edge). A predecessor node is a node from which the linkage stems, and a successor node is the node towards which the linkage is directed. Taking their union, we have an exhaustive list of connecting nodes to the ones we retrieved from our vector store.

for top_uuid in top_uuids:
    for idx in list(digraph.successor_indices(node_indices[top_uuid])) + list(digraph.predecessor_indices(node_indices[top_uuid])):
        print(digraph[idx])

{'semantic_id': 'coastal_state', 'category': 'legal_entity', 'attributes': {'name': 'coastal state'}}
{'semantic_id': 'baseline', 'category': 'geographic_feature', 'attributes': {'name': 'baseline', 'definition': 'usually the mean low-water mark of a coastal state'}}
{'semantic_id': 'baseline', 'category': 'geographic_feature', 'attributes': {'name': 'baseline', 'definition': 'usually the mean low-water mark of a coastal state'}}
{'semantic_id': 'coastal_state', 'category': 'legal_entity', 'attributes': {'name': 'coastal state'}}
{'semantic_id': 'hmda_area', 'category': 'administrative_district', 'attributes': {'name': 'Hyderabad Metropolitan Development Authority (HMDA) area', 'jurisdiction_size': 'largest', 'type': 'urban_planning_agency', 'apolitical': True, 'covers': ['ghmc_area', 'suburbs_of_ghmc_area']}, 'paragraph_idx': 6}

4.1 Transforming the Evidence

Now that we have the nodes, we need to grab their edges, and then transform both into a format that will be easily digestible for the LLM. We will still feed the nodes and edges as JSON strings, but we’ll need to augment it to replace UUIDs with something less complex like a monotonically increasing integer. This way, the LLM can use integers like 0 and 1 instead of 5f092031-cf0d-408c-a4f1-896e7c8607be and bc1c5af9-c311-4e9f-975d-349d33d41a15 when interpreting the from_node and to_node fields of the edges.

import copy

node_hist_dict = {} # idx: obj mapping
edge_hist_dict = {} # (from_idx, to_idx): obj mapping
uuid_list = [] # used for dup checking
id_counter = 0 # used for creating new easily-parseable ids
for top_uuid in top_uuids:
    top_idx = node_indices[top_uuid]
    uuid_list.append(top_uuid)
    successor_idxs = [('s', successor) for successor in digraph.successor_indices(top_idx)]
    predecessor_idxs = [('p', predecessor) for predecessor in digraph.predecessor_indices(top_idx)]
    neighbor_idxs = successor_idxs + predecessor_idxs
    if top_idx not in node_hist_dict: # Add the top node if it's not already in the node_hist_dict
        main_node = copy.deepcopy(digraph[top_idx])
        main_node['connecting_id'] = id_counter
        node_hist_dict[top_idx] = main_node
        id_counter += 1
    else:
        main_node = node_hist_dict[top_idx]
    if (len(neighbor_idxs) > 0):
        for connection_type, idx in neighbor_idxs: 
            if idx in node_hist_dict:
                secondary_connecting_id = node_hist_dict[idx]['connecting_id']
            else:
                secondary_connecting_id = id_counter
                node_hist_dict[idx] = copy.deepcopy(digraph[idx])
                node_hist_dict[idx]['connecting_id'] = secondary_connecting_id
                id_counter += 1
            # If the connections are already in the edge_hist_dict, skip
            if ((connection_type == 's' and (top_idx, idx) in edge_hist_dict) or 
                (connection_type == 'p' and (idx, top_idx) in edge_hist_dict)):
                continue
            elif connection_type == 's':
                edge_hist_dict[(top_idx, idx)] = copy.deepcopy(digraph.get_edge_data(top_idx, idx))
                edge_hist_dict[(top_idx, idx)]['from_node'] = main_node['connecting_id']
                edge_hist_dict[(top_idx, idx)]['to_node'] = secondary_connecting_id
            elif connection_type == 'p':
                edge_hist_dict[(idx, top_idx)] = copy.deepcopy(digraph.get_edge_data(idx, top_idx))
                edge_hist_dict[(idx, top_idx)]['from_node'] = secondary_connecting_id
                edge_hist_dict[(idx, top_idx)]['to_node'] = main_node['connecting_id']

For the astute reader, you may have noticed that we could’ve just used the integer values that are the indices of the nodes in the network graph. Intuitively, it makes more sense to me to use smaller integers by creating a new counter for each presentation of evidence. In practice, this may not be the case.

Lets take a look at what the evidence will look like:

pprint(list({**node_hist_dict, **edge_hist_dict}.values()))

[{'attributes': {'definition': 'a belt of coastal waters extending at most 12 '
                               'nautical miles (22.2 km; 13.8 mi) from the '
                               'baseline (usually the mean low-water mark) of '
                               'a coastal state',
                 'name': 'territorial waters',
                 'source': '1982 United Nations Convention on the Law of the '
                           'Sea'},
  'category': 'geographic_area',
  'connecting_id': 0,
  'paragraph_idx': 18,
  'semantic_id': 'territorial_waters'},
 {'attributes': {'name': 'coastal state'},
  'category': 'legal_entity',
  'connecting_id': 1,
  'semantic_id': 'coastal_state'},
 {'attributes': {'definition': 'usually the mean low-water mark of a coastal '
                               'state',
                 'name': 'baseline'},
  'category': 'geographic_feature',
  'connecting_id': 2,
  'semantic_id': 'baseline'},
 {'attributes': {'definition': 'a belt of coastal waters extending at most 12 '
                               'nautical miles (22.2 km; 13.8 mi) from the '
                               'baseline (usually the mean low-water mark) of '
                               'a coastal state',
                 'foreign_ship_passage': 'innocent passage through it or '
                                         'transit passage for straits',
                 'jurisdiction': 'extends to airspace over and seabed below',
                 'name': 'territorial sea',
                 'sovereign_territory': True},
  'category': 'geographic_area',
  'connecting_id': 3,
  'paragraph_idx': 18,
  'semantic_id': 'territorial_sea'},
 {'attributes': {'name': 'Hyderabad Metropolitan Water Supply and Sewerage '
                         'Board',
                 'type': 'water_management'},
  'category': 'administrative_body',
  'connecting_id': 4,
  'semantic_id': 'hmwssb'},
 {'attributes': {'apolitical': True,
                 'covers': ['ghmc_area', 'suburbs_of_ghmc_area'],
                 'jurisdiction_size': 'largest',
                 'name': 'Hyderabad Metropolitan Development Authority (HMDA) '
                         'area',
                 'type': 'urban_planning_agency'},
  'category': 'administrative_district',
  'connecting_id': 5,
  'paragraph_idx': 6,
  'semantic_id': 'hmda_area'},
 {'category': 'belongs_to', 'from_node': 0, 'to_node': 1},
 {'category': 'extends_from', 'from_node': 0, 'to_node': 2},
 {'category': 'extends_from', 'from_node': 3, 'to_node': 2},
 {'category': 'belongs_to', 'from_node': 3, 'to_node': 1},
 {'category': 'manages', 'from_node': 5, 'to_node': 4}]

You can now see the nodes have connecting_ids 0-4 which are then used in the from_node and to_node fields of the edges.

5 Putting It All Together

We’ve got the evidence, now we need to finalize our pipeline check the response from the LLM.

First, lets combine our System and Human templates.

combined_template = system_message + human_template
combined_template

ChatPromptTemplate(input_variables=['evidence', 'question'], messages=[SystemMessage(content='You are the best taker of tests, particularly excelling at answering questions based on information provided to you. You will be given nodes and edges from a knowledge graph in a JSON format and you are expected to answer a question based on them. The \'from_node\' and \'to_node\' fields in the edges correspond to the \'connecting_id\' fields in the nodes. Your output will only be JSON, and nothing more. No yapping.\nThis JSON Schema is the format you will be using: {"properties": {"answerable": {"description": "true or false value. Whether or not the answer is answerable based on the provided nodes and edges", "title": "Answerable", "type": "boolean"}, "answer": {"description": "The answer to the question. Terse and concise.", "title": "Answer", "type": "string"}, "support_idxs": {"description": "The indices of the nodes that support the answer. From \'paragraph_idx\' field", "items": {"type": "integer"}, "title": "Support Idxs", "type": "array"}}, "required": ["answerable", "answer", "support_idxs"], "title": "Answer", "type": "object"}'), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['evidence', 'question'], template='Question: {question}\n\n Supporting Evidence:\n {evidence}'))])

Next, we need to wrap our Pydantic class Answer in a PydanticOutputParser, which takes the output from the LLM model and parses the string as a JSON, erroring out if the structure does not match our schema. We then wrap that parser in an OutputFixingParser, which will attempt to fix any errors that may occur during the parsing process by passing the error and output back to the LLM for remediation.

from langchain.output_parsers import PydanticOutputParser, OutputFixingParser

_output_parser = PydanticOutputParser(pydantic_object=Answer)
output_parser = OutputFixingParser.from_llm(parser=_output_parser, llm=chat_model, max_retries=3)

5.1 Moment of truth

We can now instantiate the pipeline, pass in the question and evidence, then run it.

answer_pipe = combined_template | chat_model | output_parser

evidence = str(list({**node_hist_dict, **edge_hist_dict}.values()))
question = lines[-2]['question']
ans = answer_pipe.invoke({'question': question, 'evidence': evidence})

question, ans, lines[-2]['answer']

('What is the body of water by the city where Zvezda stadium is located?',
 Answer(answerable=True, answer='The Hussain Sagar lake', support_idxs=[5, 6]),
 'Kama River')

It’s okay to feel all sorts of things when you don’t get your expected result.

What

The supporting evidence makes no mention of The Hussain Sagar lake, however I did look it up, and found that it is located in Hyderabad, which is in our evidence. Here, we see the model looking into its own trained knowledgebase and ignoring the evidence.

5.2 Coping(Deal with it)

I’m not going to lie to you; I was not expecting the correct answer on the first trial run. Once we have an end-to-end pipeline established, it needs to be tuned to the task at hand. In order to tune a model, the best practice is to understand where its deficiencies lie.

According to the dataset, the supporting evidence comes from paragraphs 10 and 11, neither of which made it to our evidence.

10:

   Perm (;) is a city and the administrative centre of Perm Krai, Russia, located on
   the banks of the Kama River in the European part of Russia near the Ural Mountains.

11:

   Star (Zvezda) Stadium (), until 1991 Lenin Komsomol Stadium (), is a multi-use stadium in
   Perm, Russia. It is currently used mostly for football matches and is the home ground of FC
   Amkar Perm. The stadium holds 17,000 people and was opened on June 5, 1969.

Not only was the answer wrong, but the LLM believed that the question was answerable given the available evidence.

5.2.1 Is a more potent model the answer?

Given that the evidence does not contribute to the correct answer, we can say that our model is hallucinating due to it marking it as answerable and giving us the wrong answer. Weak models tend to hallucinate often. Could this be the case here? Lets use a more powerful model to test out our hypothesis.

chat_model_adv = ChatAnthropic(model_name="claude-3-opus-20240229")
answer_pipe = combined_template | chat_model_adv | output_parser

evidence = str(list({**node_hist_dict, **edge_hist_dict}.values()))
question = lines[-2]['question']
ans_adv = answer_pipe.invoke({'question': question, 'evidence': evidence})

This is a more sensible answer, using the evidence correctly; no good evidence, not answerable. As things should be.

It will be useful to outline the problems and their corresponding potential solutions to explore.

Problem 1: The LLM believes the question is answerable, when it is not and the model hallucinates information.

Solution 1a: Use a more powerful model. This isn’t a hard ask, given the exponential improvement and falling costs.

Solution 1b: Add a chain-of-thought step. e.g. add a field to the output template that produces the reasoning behind the answer. This may be an acceptable solution when combined with a weak model.

5.2.2 Graph Woes

The evidence returned by our pipeline is inadequate. We can look through the knowledge graph carefully and see that the nodes and edges present in the knowledge graphs contain information that IS capable of answering the question.

The following nodes are present within the knowledge graph:

{'semantic_id': 'star_stadium',
                'category': 'stadium',
                'attributes': {'name': 'Star (Zvezda) Stadium',
                 'former_name': 'Lenin Komsomol Stadium',
                 'location': {'city': 'Perm', 'country': 'Russia'},
                 'usage': 'football matches',
                 'home_team': 'FC Amkar Perm',
                 'capacity': 17000,
                 'opened': '1969-06-05'}}})

{'semantic_id': 'perm',
                'category': 'city',
                'attributes': {'name': 'Perm',
                 'location': {'river': 'Kama River',
                  'region': 'Perm Krai',
                  'country': 'Russia',
                  'geography': 'European part of Russia near the Ural Mountains'},
                 'administrative_status': 'administrative centre'}}})

As is the connection between them:

{'from_node': UUID('08f207c1-6915-4237-ac4e-902815d9cfae'),
                'to_node': UUID('5be79bf7-cd2a-487f-8833-36ae11257df8'),
                'category': 'located_in'}})

We’ve found a limitation of the semantic search. The semantics/vibes of the question matched the incorrect evidence. We can’t say exactly why, but it could simply be the amount of water-adjaced terminology found in the question and the answer. Or it could be something else entirely. This sort of latent space analysis is tough, and sometimes impossible to do well.

Problem 2: Searching our vector store for the correct nodes is inadequate. We need more than merely capture the gist of the passage based on the encoding.

Solution 2a: Use a HyDE approach where we can use an LLM to generate hypothetical questions to accompany the node information dump, so that the vector search is more likely to match the embedding of the question to the node.

Solution 2b: Hybrid search. By combining a sparse search(word-matching) and a dense search(embedding-based), we can capture the exact terminology of the question better. “Svezda stadium” would be a more likely match in that case. Also, because ‘svezda’ is a Russian word for ‘star’, the dense/semantic search would be more likely to capture the node if someone asks about “Star Stadium”, even though ‘Svezda’ isn’t part of the question.

6 Steps Moving Forward

Closing up, lets speak about the road ahead:

Composability: We have our rough outline of steps, which I reluctantly call a pipeline, that are necessary to go from paragraphs and a question to an evidenced answer. Moving forward, this should be a simple workflow which we can use to loop over any number of given items.
Error handling: Some more error handling would be nice. Some of the code that generates the knowledge graph needs to be reworked in order to error out and retry generating the nodes and edges of a particular text chunk when it hallucinates connections. I’ve seen it make up semantic_ids when creating edges, which reduces the quality of our graph because now we can’t use those edges, and they’ve likely taken the place of useful ones. This would function similarly to the OutputFixingParser we used to wrap our PydanticOutputParser and allow it to self-correct.
Graph Connection Generation: After our knowledge graph is generated, we can allow a few random(or not so random) passes of the LLM as discussed in part one to potentially create connections between disparate nodes. This step is probably not necessary given the relatively small size of the paragraphs, but it would be immensely useful for entity resolution and graph refinement for a larger corpus of text.
Chunk augmentation and prompt size expansion: It may be a good idea to also tune the chunk size we’re using to speed things up and increase performance as well as keep a larger track of nodes in our passed in history. I’m hesitant to do this because I want this approach to be as versatile as possible, and easily transferable to rely solely on local machines.
(Teaser) DSPy Prompt Tuning: As a further goal, being able to optimize the prompts we’re using would be a great benefit, and remove the arduous task of manually trying to condition an clever prompt. A further benefit of this is that modularizing the workflow this way allows also for being able to generate examples and tune smaller models which are comparably capable. This requires more effort and time, and something I’d like to get to eventually.