Don’t RAG on Knowledge Graphs(Or Do) Benchmarking: Finally Building a Knowledge Graph – Part Two

Building a knowledge graph in Python with Claude 3 Haiku(Works for ≥ GPT 3.5 as well)
knowledge-graphs
rag
benchmarking
Author

Dmitriy Leybel

Published

April 20, 2024

Abstract
This post introduces you to building a knowledge graph in Python using an LLM. This involves orchestrating the working components of LangChain in order to call the LLM, compose the prompts, and create our pipeline with its expression language. We then visualize the graph with rustworkx.

On the last episode of: Don’t RAG on Knowledge Graphs(Or Do) Benchmarking: Theory behind using an LLM to Build Knowledge Graphs – Part One:


Finally, we’re getting to the fun part. Like many, I thought this day would never come, but here we are.

I’m going to introduce the numerous components we’ll be using, and then combine them into our knowledge graph creation pipeline.

1 Lets Split Some Text

In order to feed text of reasonable length into our LLM, we need to be able to split it. The splitting criteria will be the token length of the passage. To implement this criterion, we need to create a length function that will be passed into our splitter, and then test it on one of the paragraphs we have available from the MuSiQue dataset.

Code
import tiktoken

def token_len(text: str, model: str = "gpt-4") -> int:
    encoder = tiktoken.encoding_for_model(model)
    return len(encoder.encode(text))

pprint(paragraphs[0]['paragraph_text'])
print('Token length: ', token_len(paragraphs[0]['paragraph_text']))
('The Commonwealth of the Philippines (; ) was the administrative body that '
 'governed the Philippines from 1935 to 1946, aside from a period of exile in '
 'the Second World War from 1942 to 1945 when Japan occupied the country. It '
 'replaced the Insular Government, a United States territorial government, and '
 'was established by the Tydings–McDuffie Act. The Commonwealth was designed '
 "as a transitional administration in preparation for the country's full "
 'achievement of independence.')
Token length:  95

As noted in the last post, we’re going to do a little assuming about the Claude 3 Haiku tokenization and say that it’s comparable to the latest OpenAI models – which is why we’re going to get away with using OpenAI’s tokenizer, tiktoken.

As of this writing, Meta’s Llama 3 was just released and is using OpenAI’s tiktoken (and it’s incredible)

We’ll be using LangChain’s RecursiveCharacterTextSplitter to split the text into chunks. It algorithmically uses punctuation to help split the text in order to preserve some sentence structure, so sometimes, the chunks will be smaller than our specified chunk size. For illustrative purposes, the following example will use a chunk size and a chunk overlap different from what we’ll end up using in the pipeline. Two of the paragraphs are split below with a specified chunk size of 20 and an overlap of 5. If you peek into the code, you can see that we’re using our length function as the determinant of splits.

Code
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=20, chunk_overlap=5, length_function=token_len)
splits0 = text_splitter.split_text(paragraphs[0]['paragraph_text'])
splits0_tups = [('Token length: ' + str(token_len(s)), s) for s in splits0]
splits1 = text_splitter.split_text(paragraphs[1]['paragraph_text'])
splits1_tups = [('Token length: ' + str(token_len(s)), s) for s in splits1]

display(Markdown('**Paragraph 1**'))
pprint(splits0_tups)
display(Markdown('**Paragraph 2**'))
display(splits1_tups)

Paragraph 1

[('Token length: 20',
  'The Commonwealth of the Philippines (; ) was the administrative body that '
  'governed the Philippines from 1935 to'),
 ('Token length: 20',
  'from 1935 to 1946, aside from a period of exile in the Second World War'),
 ('Token length: 20',
  'in the Second World War from 1942 to 1945 when Japan occupied the country. '
  'It'),
 ('Token length: 20',
  'occupied the country. It replaced the Insular Government, a United States '
  'territorial government, and was established'),
 ('Token length: 20',
  'government, and was established by the Tydings–McDuffie Act. The '
  'Commonwealth was designed'),
 ('Token length: 19',
  'The Commonwealth was designed as a transitional administration in '
  "preparation for the country's full achievement of independence.")]

Paragraph 2

[('Token length: 18',
  'Lake Oesa is a body of water located at an elevation of 2,267m'),
 ('Token length: 19',
  '2,267m (7438 ft) in the mountains of Yoho National Park, near'),
 ('Token length: 11', 'National Park, near Field, British Columbia, Canada.')]

2 Prompting

Prompting our model is as simple as loading up the API key as an environmental variable, then instantiating the model with Langchain. We can pass in any text string we want to the model as long as it observes the token limits.

from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic

load_dotenv()

chat_model = ChatAnthropic(model_name='claude-3-haiku-20240307')
joke = chat_model.invoke("Tell me a mid joke about airplanes and horses")
display(joke)
joke.pretty_print()
AIMessage(content="Here's a mildly silly joke about airplanes and horses:\n\nWhy did the horse refuse to get on the airplane? Because it already had a stable flight plan!", response_metadata={'id': 'msg_01K9jCUru7b4TiBBC6eaWRxf', 'model': 'claude-3-haiku-20240307', 'stop_reason': 'end_turn', 'stop_sequence': None, 'usage': {'input_tokens': 18, 'output_tokens': 39}}, id='run-2cb963b0-3180-432e-95a7-368169c5bef0-0')
================================== Ai Message ==================================

Here's a mildly silly joke about airplanes and horses:

Why did the horse refuse to get on the airplane? Because it already had a stable flight plan!

dotenv allows us to load environmental variables from a .env file

It’s so over for stand up comedians.

While we can easily pass strings into the LLM call, LangChain provides us with templates, which enable endless composability and modularity, as will be witnessed as we create our fairly elaborate prompts – but first, an illustration of the structure we’ll be using.

Figure 1: Prompt Template Composition

As witnessed in the above, we’re creating a template out of multiple templates. A System Message is a message sent to an LLM that tells it how to respond, in the style, tone, or format of your choosing; it primes it with an ‘identity’. The Human Message is the message you send to the LLM after you prime it with a system message. Do you actually need to differentiate between them? Meh. In my experience it makes no difference and I haven’t seen any testing to suggest otherwise, but in the case that future models start to take the distinction more seriously, we should continue using it. LLMs which function as chat models tend to be able to take a series of messages through their APIs, which LangChain is helping us facilitate.

Lets decompose the components of gen_template, the main template we’ll be using in our pipeline.

The difference between a prompt and a template is the fact that a template can contain {placeholder variables} which can be replaced in our pipeline, as you will see.

2.1 graph_analyst_template

This is the main system prompt template. It’s going to inform the LLM of its purpose, the format we expect it to return to us, the format of what we send to it, and any history we want it to take into account when generating its response.

2.1.1 Instructions (Pydantic and JSON Schema Magic)

To programatically build a knowledge graph, the output of the LLM will have to be very specific and in a format we can easily process. Foundational models like Claude 3 excel at processing code and various formatted specifications. The specification that’s of interest to us is the JSON Schema, which is designed to describe the structure of JSON data. Here are some examples of this specification. It describes the fields, their types, and any particular data structures you need in your JSON.

I trust you’ve perused the examples and are not too stoked to write all of that out yourself. Well, you won’t have to because we can express the same thing in a cleaner pythonic format using the Pydantic library – it makes structured outputs a breeze. In fact, there are entire libraries, like Instructor that are centered on using Pydantic to generate structured output from LLMs that help you validate the output against the schema specification.

The nodes and edges we need to construct for the knowledge graph aren’t overly complex, but they do have their nuances and enough moving parts to warrant a systemic approach to their production.

Figure 2: The node-edge structure we construct from the outputs.

Each individual node has an identifier, a category, a variable number of attributes, the source text it was created from, and an identifier of the paragraph it was created from taken from the dataset itself. The LLM won’t have to generate all of the properties, as the paragraph ID is simply taken from the paragraph that creates it; in fact, it can probably be a list of IDs where that particular node is referenced. The edges are a degree simpler, as they just need a category, some attributes, and the nodes which they connect.

Pydantic, along with a similar sort of workflow can be generalized for structured extraction of any sort with LLMs. You define the JSON structure, feed the LLM a passage, and it extracts the fields you specified. This is a complete game-changer for machine learning and feature generation(much more exciting than chatbots, IMO).

Below, you’ll see each class represent a distinct JSON object, with the fields and instructions that the model will receive. By using the BaseModel superclass😎, we can create Pydantic classes with the following syntax:

from pydantic import BaseModel, Field
from typing import Dict, List, Union, Tuple, Optional
import json

class Node(BaseModel):
    semantic_id: str = Field(..., description="The unique identifier of the node that is \
                             a reference to create edges between different nodes.")
    category: str = Field(..., description="The category of the node")
    attributes: Optional[Dict[str, Union[str, int, bool]]] = Field(None, description="Additional properties of the node")

class Edge(BaseModel):
    from_node: str = Field(..., description="The id of the node from which the edge originates. Only previously generated semantic_ids belong here, nothing else.")
    to_node: str = Field(..., description="The id of the node to which the edge connects. Only previously generated semantic_ids belong here, nothing else.")
    category: str = Field(..., description="The type of the relationship")
    attributes: Optional[Dict[str, Union[str, int, bool]]] = Field(None, description="Additional properties of the edge")

class Graph(BaseModel):
    nodes: List[Node] = Field(...,description="A list of nodes in the graph")
    edges: List[Edge] = Field(...,description="A list of edges in the graph")

Graph.model_json_schema()
{'$defs': {'Edge': {'properties': {'from_node': {'description': 'The id of the node from which the edge originates. Only previously generated semantic_ids belong here, nothing else.',
     'title': 'From Node',
     'type': 'string'},
    'to_node': {'description': 'The id of the node to which the edge connects. Only previously generated semantic_ids belong here, nothing else.',
     'title': 'To Node',
     'type': 'string'},
    'category': {'description': 'The type of the relationship',
     'title': 'Category',
     'type': 'string'},
    'attributes': {'anyOf': [{'additionalProperties': {'anyOf': [{'type': 'string'},
         {'type': 'integer'},
         {'type': 'boolean'}]},
       'type': 'object'},
      {'type': 'null'}],
     'default': None,
     'description': 'Additional properties of the edge',
     'title': 'Attributes'}},
   'required': ['from_node', 'to_node', 'category'],
   'title': 'Edge',
   'type': 'object'},
  'Node': {'properties': {'semantic_id': {'description': 'The unique identifier of the node that is                              a reference to create edges between different nodes.',
     'title': 'Semantic Id',
     'type': 'string'},
    'category': {'description': 'The category of the node',
     'title': 'Category',
     'type': 'string'},
    'attributes': {'anyOf': [{'additionalProperties': {'anyOf': [{'type': 'string'},
         {'type': 'integer'},
         {'type': 'boolean'}]},
       'type': 'object'},
      {'type': 'null'}],
     'default': None,
     'description': 'Additional properties of the node',
     'title': 'Attributes'}},
   'required': ['semantic_id', 'category'],
   'title': 'Node',
   'type': 'object'}},
 'properties': {'nodes': {'description': 'A list of nodes in the graph',
   'items': {'$ref': '#/$defs/Node'},
   'title': 'Nodes',
   'type': 'array'},
  'edges': {'description': 'A list of edges in the graph',
   'items': {'$ref': '#/$defs/Edge'},
   'title': 'Edges',
   'type': 'array'}},
 'required': ['nodes', 'edges'],
 'title': 'Graph',
 'type': 'object'}

The Graph class is the ultimate class we’re using to generate the JSON schema. It combines the Node and Edge classes into lists, as we want the final output to be a collection of nodes and the edges that connect them. model_json_schema() outputs the JSON schema of the format we want the LLM to return.

It may be worthwhile to read through the fields and their descriptions carefully, and mind the semantic_id in the Node class; its purpose is to allow the LLM to use that identifier in the from_node and to_node fields of the edges.

You can probably use Pydantic classes to describe the JSON output we need without even generating the JSON schema. Such is the magic of LLMs.

In addition to our fancy JSON schema generated with Pydantic, which already includes some descriptions of the fields, we need to pass in some instructions.

json_rules = \
"""We need to create a JSON object that contains a list of nodes and edges that connect the nodes.
Both, nodes and edges, have optional attributes.
Your goal is to extract as much pertinent information from the passage as possible and create nodes and edges with the extracted information.
If history is provided, it will be in the JSON schema you are given. You may create new connections between the nodes and edges in the history and the new nodes you are producing.
If you wish to change/update any of the node attributes in the provided history based on newly gathered information, simply reuse the semantic_ids of the nodes you wish to change.
If you wish to modify/update the edge attributes in the history, reuse the semantic_ids of the 'from' and 'to' nodes of any edge you wish to change.
Use the following schema and make sure to read the descriptions:
""" 

json_prompt_instructions = \
    json_rules + \
    json.dumps(Graph.model_json_schema()) + \
    "\n-----\n"

pprint(json_prompt_instructions)
('We need to create a JSON object that contains a list of nodes and edges that '
 'connect the nodes.\n'
 'Both, nodes and edges, have optional attributes.\n'
 'Your goal is to extract as much pertinent information from the passage as '
 'possible and create nodes and edges with the extracted information.\n'
 'If history is provided, it will be in the JSON schema you are given. You may '
 'create new connections between the nodes and edges in the history and the '
 'new nodes you are producing.\n'
 'If you wish to change/update any of the node attributes in the provided '
 'history based on newly gathered information, simply reuse the semantic_ids '
 'of the nodes you wish to change.\n'
 'If you wish to modify/update the edge attributes in the history, reuse the '
 "semantic_ids of the 'from' and 'to' nodes of any edge you wish to change.\n"
 'Use the following schema and make sure to read the descriptions:\n'
 '{"$defs": {"Edge": {"properties": {"from_node": {"description": "The id of '
 'the node from which the edge originates. Only previously generated '
 'semantic_ids belong here, nothing else.", "title": "From Node", "type": '
 '"string"}, "to_node": {"description": "The id of the node to which the edge '
 'connects. Only previously generated semantic_ids belong here, nothing '
 'else.", "title": "To Node", "type": "string"}, "category": {"description": '
 '"The type of the relationship", "title": "Category", "type": "string"}, '
 '"attributes": {"anyOf": [{"additionalProperties": {"anyOf": [{"type": '
 '"string"}, {"type": "integer"}, {"type": "boolean"}]}, "type": "object"}, '
 '{"type": "null"}], "default": null, "description": "Additional properties of '
 'the edge", "title": "Attributes"}}, "required": ["from_node", "to_node", '
 '"category"], "title": "Edge", "type": "object"}, "Node": {"properties": '
 '{"semantic_id": {"description": "The unique identifier of the node that '
 'is                              a reference to create edges between '
 'different nodes.", "title": "Semantic Id", "type": "string"}, "category": '
 '{"description": "The category of the node", "title": "Category", "type": '
 '"string"}, "attributes": {"anyOf": [{"additionalProperties": {"anyOf": '
 '[{"type": "string"}, {"type": "integer"}, {"type": "boolean"}]}, "type": '
 '"object"}, {"type": "null"}], "default": null, "description": "Additional '
 'properties of the node", "title": "Attributes"}}, "required": '
 '["semantic_id", "category"], "title": "Node", "type": "object"}}, '
 '"properties": {"nodes": {"description": "A list of nodes in the graph", '
 '"items": {"$ref": "#/$defs/Node"}, "title": "Nodes", "type": "array"}, '
 '"edges": {"description": "A list of edges in the graph", "items": {"$ref": '
 '"#/$defs/Edge"}, "title": "Edges", "type": "array"}}, "required": ["nodes", '
 '"edges"], "title": "Graph", "type": "object"}\n'
 '-----\n')

This prompt states that if a history of nodes and edges is provided, then the LLM is at liberty to reuse those semantic ids in order to modify their respective nodes and edges. Doing this allows for the knowledge graph to grow more dynamically as it processes more information.

For example, if we have two separate chunks of text that the LLM is exposed to at different times, considering that there is some adjacency between the processing of the passages, since we won’t keep the entire history of nodes and edges in the context window.

Fido ran over the bridge

and

Fido was hungry and stole a donut.

The semantic_id that identifies Fido would persist, so that the particular entity wouldn’t be duplicated.

Figure 3: The semantic id allows for continuity of the entity ‘Fido’

2.1.2 Content

In addition to the JSON formatting instructions, we give the model some high-level guidance. The placeholders are included as {instructions} where the previously constructed JSON instructions will go, and history where past nodes and edges will be inserted – the format isn’t critical, but we’ll stick to the JSON schema we’re using for the output.

graph_creator_content = \
"""You are a brilliant and efficient creator of JSON objects that capture the essence of passages and who follows instructions unbelievably well.
You will be first given instructions and a json schema, then you will be provided a passage to extract the information from.
You will only respond with valid JSON, nothing else.
Your instructions are:
{instructions}
History:
{history}
"""

2.2 pass_passage_template

The human message portion of this template consists of something as simple as:

pass_passage_content = "Below is the passage to extract the values from.\n*****\nPassage:\n{passage}"

where {passage] is our placeholder for the chunk(s) of text we grab from our paragraphs.

2.3 Combining the Prompt Templates

To create our Langchain pipeline, we wrap the templates we created in SystemMessagePromptTemplate and HumanMessagePromptTemplate classes, and then combine them into gen_template.

from langchain_core.prompts import (
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)

graph_analyst_template = SystemMessagePromptTemplate.from_template(template=graph_creator_content,
                                                                   input_variables=['history', 'instructions'])
pass_passage_template = HumanMessagePromptTemplate.from_template(pass_passage_content, input_variables=['passage'])

gen_template = graph_analyst_template + pass_passage_template

gen_template.invoke({'passage': paragraphs[0]['paragraph_text'],
                     'history': '',
                     'instructions': json_prompt_instructions})
ChatPromptValue(messages=[SystemMessage(content='You are a brilliant and efficient creator of JSON objects that capture the essence of passages and who follows instructions unbelievably well.\nYou will be first given instructions and a json schema, then you will be provided a passage to extract the information from.\nYou will only respond with valid JSON, nothing else.\nYour instructions are:\nWe need to create a JSON object that contains a list of nodes and edges that connect the nodes.\nBoth, nodes and edges, have optional attributes.\nYour goal is to extract as much pertinent information from the passage as possible and create nodes and edges with the extracted information.\nIf history is provided, it will be in the JSON schema you are given. You may create new connections between the nodes and edges in the history and the new nodes you are producing.\nIf you wish to change/update any of the node attributes in the provided history based on newly gathered information, simply reuse the semantic_ids of the nodes you wish to change.\nIf you wish to modify/update the edge attributes in the history, reuse the semantic_ids of the \'from\' and \'to\' nodes of any edge you wish to change.\nUse the following schema and make sure to read the descriptions:\n{"$defs": {"Edge": {"properties": {"from_node": {"description": "The id of the node from which the edge originates. Only previously generated semantic_ids belong here, nothing else.", "title": "From Node", "type": "string"}, "to_node": {"description": "The id of the node to which the edge connects. Only previously generated semantic_ids belong here, nothing else.", "title": "To Node", "type": "string"}, "category": {"description": "The type of the relationship", "title": "Category", "type": "string"}, "attributes": {"anyOf": [{"additionalProperties": {"anyOf": [{"type": "string"}, {"type": "integer"}, {"type": "boolean"}]}, "type": "object"}, {"type": "null"}], "default": null, "description": "Additional properties of the edge", "title": "Attributes"}}, "required": ["from_node", "to_node", "category"], "title": "Edge", "type": "object"}, "Node": {"properties": {"semantic_id": {"description": "The unique identifier of the node that is                              a reference to create edges between different nodes.", "title": "Semantic Id", "type": "string"}, "category": {"description": "The category of the node", "title": "Category", "type": "string"}, "attributes": {"anyOf": [{"additionalProperties": {"anyOf": [{"type": "string"}, {"type": "integer"}, {"type": "boolean"}]}, "type": "object"}, {"type": "null"}], "default": null, "description": "Additional properties of the node", "title": "Attributes"}}, "required": ["semantic_id", "category"], "title": "Node", "type": "object"}}, "properties": {"nodes": {"description": "A list of nodes in the graph", "items": {"$ref": "#/$defs/Node"}, "title": "Nodes", "type": "array"}, "edges": {"description": "A list of edges in the graph", "items": {"$ref": "#/$defs/Edge"}, "title": "Edges", "type": "array"}}, "required": ["nodes", "edges"], "title": "Graph", "type": "object"}\n-----\n\nHistory:\n\n'), HumanMessage(content="Below is the passage to extract the values from.\n*****\nPassage:\nThe Commonwealth of the Philippines (; ) was the administrative body that governed the Philippines from 1935 to 1946, aside from a period of exile in the Second World War from 1942 to 1945 when Japan occupied the country. It replaced the Insular Government, a United States territorial government, and was established by the Tydings–McDuffie Act. The Commonwealth was designed as a transitional administration in preparation for the country's full achievement of independence.")])

invoke is a generic command in Langchain’s expression language(LCEL) which can be applied to many of the Langchain elements in order to ‘trigger’ them. This makes the interface quite simple when building chains of elements and extending the types of elements that are available to your custom chains by implementing your own classes that contain the invoke method(and others).

Generally, we can use partial_variables within the prompt templates in order to not have to pass in the json_prompt_instructions on each invocation – but a recent Langchain update(langchain == 0.1.16) did us wrong and broke that for quite a few templates.

3 Knowledge Graph Generation (Without History)

We now, more or less, have the components necessary to give knowledge graph generation a first go. Development is generally iterative so we’ll leave out the history aspect of it for the time being.

We now will take a gander at the LCEL in action:

load_dotenv()

chat_model = ChatAnthropic(model_name='claude-3-haiku-20240307')
# json_output_parser = JsonOutputParser()

llm_pipe = gen_template | chat_model

That’s all there is to it. We pipe(|) the output from invoking the gen_template straight to the chat_model which also gets invoked.

response = llm_pipe.invoke({'passage': paragraphs[0]['paragraph_text'],
                 'history': '',
                 'instructions': json_prompt_instructions})

llm_pipe is passed the same arguments that gen_template would’ve been.

pprint(response.content)
('{\n'
 '  "nodes": [\n'
 '    {\n'
 '      "semantic_id": "commonwealth_of_the_philippines",\n'
 '      "category": "administrative_body",\n'
 '      "attributes": {\n'
 '        "name": "Commonwealth of the Philippines",\n'
 '        "government_period": "1935 to 1946",\n'
 '        "purpose": "transitional administration in preparation for '
 'independence"\n'
 '      }\n'
 '    },\n'
 '    {\n'
 '      "semantic_id": "insular_government",\n'
 '      "category": "territorial_government",\n'
 '      "attributes": {\n'
 '        "name": "Insular Government",\n'
 '        "governed_by": "United States"\n'
 '      }\n'
 '    },\n'
 '    {\n'
 '      "semantic_id": "japan",\n'
 '      "category": "country",\n'
 '      "attributes": {\n'
 '        "name": "Japan",\n'
 '        "occupied_the_philippines": "1942 to 1945"\n'
 '      }\n'
 '    },\n'
 '    {\n'
 '      "semantic_id": "tydings_mcduffie_act",\n'
 '      "category": "legislation",\n'
 '      "attributes": {\n'
 '        "name": "Tydings–McDuffie Act",\n'
 '        "established": "Commonwealth of the Philippines"\n'
 '      }\n'
 '    }\n'
 '  ],\n'
 '  "edges": [\n'
 '    {\n'
 '      "from_node": "insular_government",\n'
 '      "to_node": "commonwealth_of_the_philippines",\n'
 '      "category": "replaced"\n'
 '    },\n'
 '    {\n'
 '      "from_node": "commonwealth_of_the_philippines",\n'
 '      "to_node": "japan",\n'
 '      "category": "occupied_by",\n'
 '      "attributes": {\n'
 '        "period": "1942 to 1945"\n'
 '      }\n'
 '    },\n'
 '    {\n'
 '      "from_node": "tydings_mcduffie_act",\n'
 '      "to_node": "commonwealth_of_the_philippines",\n'
 '      "category": "established"\n'
 '    }\n'
 '  ]\n'
 '}')

Would you look at that, it did what we told it to, and it cost less than a penny. However, it’s still a string, so we need to convert it into a more amiable format.

from langchain_core.output_parsers import JsonOutputParser

json_output_parser = JsonOutputParser()
json_output_parser.invoke(response)
{'nodes': [{'semantic_id': 'commonwealth_of_the_philippines',
   'category': 'administrative_body',
   'attributes': {'name': 'Commonwealth of the Philippines',
    'government_period': '1935 to 1946',
    'purpose': 'transitional administration in preparation for independence'}},
  {'semantic_id': 'insular_government',
   'category': 'territorial_government',
   'attributes': {'name': 'Insular Government',
    'governed_by': 'United States'}},
  {'semantic_id': 'japan',
   'category': 'country',
   'attributes': {'name': 'Japan',
    'occupied_the_philippines': '1942 to 1945'}},
  {'semantic_id': 'tydings_mcduffie_act',
   'category': 'legislation',
   'attributes': {'name': 'Tydings–McDuffie Act',
    'established': 'Commonwealth of the Philippines'}}],
 'edges': [{'from_node': 'insular_government',
   'to_node': 'commonwealth_of_the_philippines',
   'category': 'replaced'},
  {'from_node': 'commonwealth_of_the_philippines',
   'to_node': 'japan',
   'category': 'occupied_by',
   'attributes': {'period': '1942 to 1945'}},
  {'from_node': 'tydings_mcduffie_act',
   'to_node': 'commonwealth_of_the_philippines',
   'category': 'established'}]}

Using Langchain’s JsonOutputParser allows us to easily convert the JSON string into a Python dictionary object. We’re once again calling invoke which means it could easily be inserted into our pipeline:

llm_pipe = gen_template | chat_model | json_output_parser

Before assuming that the output would be correctly structured JSON string, we needed to see it for ourselves. If the output from gen_template | chat_model ended up anything else other than a JSON string that our parser can handle, we would’ve received an unfortunate error.

Generally speaking, if you have a prompt that plays ball with an LLM of your choosing, you’re fairly safe when it comes to receiving the structured output in the subsequent calls. However, it is a best practice to involve a failsafe that can retry the process in the even of failure. The failsafe method can involve something as simple as sending the faulty output along with a string that describes your desired output back into the LLM for re-evaluation. For instance:

You didn't output the proper JSON format. Please try again.
This was your output:
{output}

We can skip that for now, and see how robust our pipeline really is. Risk it for the biscuit. 🙏

3.1 Visualization with rustworkx

The easiest way to visualize our newly-formed knowledge graph is by using a network graph library; in our case, I’ve chosen rustworkx. It’s a Python library that allows for the creation, manipulation, and rendering of directed graphs. If you’re familiar with networkx, then the syntax will be very similar, however the performance is a magnitude faster given that all of the internal goodies are written in Rust.

import rustworkx as rx
from rustworkx.visualization import mpl_draw

# Create a directed graph
digraph = rx.PyDiGraph()

# Add nodes to the graph
node_indices = {}
for node in nodes_edges_json["nodes"]:
    idx = digraph.add_node(node)
    node_indices[node["semantic_id"]] = idx

# Add edges to the graph
for edge in nodes_edges_json["edges"]:
    from_idx = node_indices[edge["from_node"]]
    to_idx = node_indices[edge["to_node"]]
    digraph.add_edge(from_idx, to_idx, edge)

# Visualize the graph with labels based on node and edge categories
mpl_draw(digraph, with_labels=True,
         labels=lambda node: f'{node["category"]}\n{node.get("attributes", "")}',
         edge_labels=lambda edge: f'{edge["category"]}\n{edge.get("attributes", "")}',
         font_size=9)

It ain’t pretty, but it’s honest work.

To create the graph, we use a dictionary to map the node semantic_id to the generated node index which is output when we create a new node. Then to create edges, that mapping is used to convert the semantic_id to the index.

4 Knowledge Graph Generation (With History)

4.1 History Management

When it comes to managing the history of nodes and edges, there is a tiny bit of overhead involved. We need to:

  • Keep track of the generated nodes and edges and thusly provide them with unique identifiers

  • Add new edges and nodes to the history

  • Update edges and nodes if the LLM makes changes to them

  • Return a string representation of the nodes and edges to our pipeline using a specified token limit dependent on the context size

To do this, we will create a magnificent GraphHistory class that manages this storage and retrieval.

(Unfolding the code not for the faint of heart)

Code
import logging
import logging.config
import param
from collections import OrderedDict
from copy import deepcopy
from uuid import uuid4
import json
from typing import Union, List, Dict

with open('../logs/logging_config.json', 'r') as f:
    config = json.load(f)
logging.config.dictConfig(config)
logger = logging.getLogger('root')

class GraphHistory(param.Parameterized):
    nodes_alias = param.String('nodes')
    edges_alias = param.String('edges')
    history = param.Dict(default=OrderedDict())
    latest_history = param.Dict(default=OrderedDict(),
        doc="Generated when get_history_str is run; contains {uuid: {nodes|edges: {object}} mapping. \
            Meant to be used for managing the current history window and modifications")
    latest_history_mapping = param.Dict(default=OrderedDict(),
        doc="Maps semantic_id to uuid for the latest history items as well as the node pairs to an edge uuid")
    token_max = param.Integer(default=400)
    
    def add_history(self, new_items: Union[List, Dict], return_with_uuid: bool = True):
        """
        Nodes are added directly to the history with their UUIDs. 
        Edges are added only after their 'from_node' and 'to_node' fields are set to the corresponding
        node UUIDs. This ensures that edges reference the correct nodes in the graph.
        """
        new_items = deepcopy(new_items)
        if isinstance(new_items, dict):
            new_items = [new_items]  # Ensure new_items is always a list for consistency
        history_list = []
        for item in new_items:
            item_type = self.nodes_alias if self.nodes_alias in item else self.edges_alias
            # Makes it easier to work with the inner dict of {nodes|edges: {*inner_dict*}}
            item_dict = item[item_type]
            if item_type == self.nodes_alias:  # Handling nodes
                # If node exists in latest_history, we want to modify it and move it to the bottom in history
                # No need to add to latest_history, since it won't be used since item exists in it already, and will be reset on next get_history_str call
                if item_dict['semantic_id'] in self.latest_history_mapping:
                    uuid_gen = self.latest_history_mapping[item_dict['semantic_id']]
                    self.history[uuid_gen] = item
                    self.history.move_to_end(uuid_gen)
                    logger.debug(f"Node exists in latest_history, moving to end of history: {uuid_gen}: {item}")
                else:
                    uuid_gen = uuid4()
                    self.history[uuid_gen] = item
                    logger.debug(f"Added node to history: {str(uuid_gen)}: {item}")
                    self.latest_history[uuid_gen] = item
                    self.latest_history_mapping[item_dict['semantic_id']] = uuid_gen
                if return_with_uuid:
                    history_list.append((str(uuid_gen), item))
                else:
                    history_list.append(item)
            else:  # Handling edges
                from_semantic_id = item_dict['from_node']
                to_semantic_id = item_dict['to_node']
                # Ensure 'from_node' and 'to_node' reference the correct UUIDs from the recently added nodes
                # TODO Add exception handling for when the LLM thinks that a node exists when it doesn't. Try, except
                try:
                    item_dict['from_node'] = self.latest_history_mapping[from_semantic_id]
                    item_dict['to_node'] = self.latest_history_mapping[to_semantic_id]
                except KeyError:
                    print(f"KeyError: {from_semantic_id} or {to_semantic_id} not found in latest_history_mapping")
                    continue
                # If the edge is in the latest history according to its from and to nodes, then we update it
                if (from_to_tuple:=(item_dict['from_node'], item_dict['to_node'])) in self.latest_history_mapping:
                    uuid_gen = self.latest_history_mapping[from_to_tuple]
                    self.history[uuid_gen] = item
                    self.history.move_to_end(uuid_gen)
                    logger.debug(f"Edge exists in latest_history_mapping, moving to end of history: {uuid_gen}: {item}")
                else:
                    uuid_gen = uuid4()
                    self.history[uuid_gen] = item
                    logger.debug(f"Added edge to history: {str(uuid_gen)}: {item}")
                if return_with_uuid:
                    history_list.append((str(uuid_gen), item))
                else:
                    history_list.append(item)
        return deepcopy(history_list[0]) if len(history_list) == 1 else deepcopy(history_list)
    
    def get_history_window(self, token_max: int = None):
        if token_max is None:
            token_max = self.token_max  # Use default token_max if not specified
        self.latest_history.clear()  # Clear the latest history for a fresh start
        self.latest_history_mapping.clear()  # Also clear the latest history mapping
        logger.debug(f"Cleared latest_history_mapping and latest_history")
        token_tracking = 0
        for uuid, item in reversed(self.history.items()):
            item_type = self.nodes_alias if self.nodes_alias in item else self.edges_alias
            item_dict = item[item_type]
            token_tracking += token_len(self._item_to_json_str(deepcopy(item)))
            if token_tracking < token_max:
                self.latest_history[uuid] = item  # Update latest_history with the current item
                if item_type == self.nodes_alias:
                    self.latest_history_mapping[item_dict['semantic_id']] = uuid  # Update latest_history_mapping
                    logger.debug(f"Added node to latest_history_mapping: {item_dict['semantic_id']}: {uuid}")
                elif item_type == self.edges_alias:
                    self.latest_history_mapping[(item_dict['from_node'], item_dict['to_node'])] = uuid
                    logger.debug(f"Added edge to latest_history_mapping: ({item_dict['from_node']}, {item_dict['to_node']}): {uuid}")
            else:
                break  # Stop adding items if token_max is reached
        return deepcopy(self.latest_history)  # Return the history as a string

    def _item_to_json_str(self, item):
        item_type = self.nodes_alias if self.nodes_alias in item else self.edges_alias
        item_dict = item[item_type]
        if item_type == self.edges_alias:
            item_dict['from_node'] = self.history[item_dict['from_node']][self.nodes_alias]['semantic_id']
            item_dict['to_node'] = self.history[item_dict['to_node']][self.nodes_alias]['semantic_id']
        return json.dumps(item_dict)
            
    def get_history_str(self, token_max: int = None):
        """
        Returns a history string based on the token length specification and updates the latest_history
        """
        latest_history = self.get_history_window(token_max)
        json_list = []
        for uuid, item in latest_history.items():
            json_list.append(self._item_to_json_str(item))
        if json_list:
            json_str = "\n".join(json_list)
            logger.debug(f"JSON History string created: {json_str}")
            return json_str

The code above addresses the bullets representing our requirements, however, there are some messy workarounds where we skip creating edges when it imagines node names. The ideal handling of this would involve rerunning the generation and feeding it the error, but we’re going to wing it and skip this for better or worse. The mistaken identity shouldn’t be very common, but it can occur.

5 Putting it All Together

Now that we have an ability to store and inject history into our pipeline, we’re ready to go.

We’re iterating over all of the paragraphs, and then splitting each paragraph with the RecursiveCharacterTextSplitter.

Some things to note about our new pipeline before you dive in:

  1. The JSON parser is now wrapped with a special OutputFixingParser class from Langchain that in the event of an error like a JSONDecodeError, it sends that error back to the LLM and tries to generate the correct format. Experimenting with Claude 3 Haiku has led me to add that, as it had generated faulty JSON(unlike GPT 3.5). This gives more credence to the user stories claiming that Claude 3 is more buddy-buddy with XML over JSON.
  2. A way to handle the RateLimitError exception was added, in the event that the API complains when we generate too many nodes and edges back to back. All it takes is waiting a minute before retrying.
  3. The paragraph_idx is added to the nodes to indicate which paragraph it was generated from.
  4. The nodes and edges we generate are stored in graph_history, which is a list of objects similar to what we generated here, but with UUIDs for unique identification(the semantic_id generated may, by chance alone, be the same)
from json import JSONDecodeError
import time

from anthropic import RateLimitError
from langchain.output_parsers import OutputFixingParser

splitter = RecursiveCharacterTextSplitter(chunk_size=70, chunk_overlap=20, length_function=token_len)

json_fixing_parser = OutputFixingParser.from_llm(parser=json_output_parser, llm=chat_model, max_retries=3)

llm_pipe = gen_template | chat_model | json_fixing_parser
graph_history = GraphHistory()

graph_components = []
for paragraph in paragraphs:
    splits = splitter.split_text(paragraph['paragraph_text'])
    for split in splits:
        local_history = graph_history.get_history_str(token_max=600)
        try:
            response = llm_pipe.invoke(
                {'passage': split,
                'history': local_history,
                'instructions': json_prompt_instructions})
        except RateLimitError as e:
            print(e)
            time.sleep(60)  # Wait for a minute
            response = llm_pipe.invoke(
                {'passage': split,
                'history': local_history,
                'instructions': json_prompt_instructions})
            continue
        if 'nodes' in response:
            for node in response['nodes']:
                if 'semantic_id' not in node:
                    continue
                node['paragraph_idx'] = paragraph['idx']
                graph_history.add_history({'nodes': node}, return_with_uuid=False)
        if 'edges' in response:
            for edge in response['edges']:
                graph_history.add_history({'edges': edge}, return_with_uuid=False)

        graph_components.append(response)
KeyError: silver-lake or minnewawa-brook not found in latest_history_mapping
KeyError: minnewawa-brook or the-branch not found in latest_history_mapping
KeyError: the-branch or ashuelot-river not found in latest_history_mapping
KeyError: ashuelot-river or connecticut-river not found in latest_history_mapping
KeyError: veteran-banker or treasury not found in latest_history_mapping
graph_history.history
OrderedDict([(UUID('acc73dc0-d5ae-499e-8cc4-63f70f2d935f'),
              {'nodes': {'semantic_id': 'insular_government',
                'category': 'political_entity',
                'attributes': {'name': 'Insular Government',
                 'description': 'A United States territorial government that was replaced by the Commonwealth of the Philippines.'}}}),
             (UUID('38d26bc0-e096-4524-945a-77b9e4ae0f49'),
              {'nodes': {'semantic_id': 'commonwealth_of_the_philippines',
                'category': 'political_entity',
                'attributes': {'name': 'Commonwealth of the Philippines',
                 'years_active': '1935 to 1946',
                 'description': 'The administrative body that governed the Philippines during this period, except for a period of exile from 1942 to 1945 when Japan occupied the country.'}}}),
             (UUID('9555e806-1679-4bc0-99d7-55717f21bdef'),
              {'nodes': {'semantic_id': 'tydings_mcduffie_act',
                'category': 'legal_document',
                'attributes': {'name': 'Tydings–McDuffie Act',
                 'description': "The act that established the Commonwealth of the Philippines as a transitional administration in preparation for the country's full achievement of independence."}}}),
             (UUID('90d9b404-119e-4b31-a0cf-ad102105687f'),
              {'edges': {'from_node': UUID('acc73dc0-d5ae-499e-8cc4-63f70f2d935f'),
                'to_node': UUID('38d26bc0-e096-4524-945a-77b9e4ae0f49'),
                'category': 'replaced'}}),
             (UUID('2cdca930-c9f5-4d65-8da0-4fcc351ac2d0'),
              {'edges': {'from_node': UUID('9555e806-1679-4bc0-99d7-55717f21bdef'),
                'to_node': UUID('38d26bc0-e096-4524-945a-77b9e4ae0f49'),
                'category': 'established'}}),
             (UUID('c7120f30-4152-4e88-bec1-698bfdd2d5e1'),
              {'nodes': {'semantic_id': 'lake_oesa',
                'category': 'natural_feature',
                'attributes': {'name': 'Lake Oesa',
                 'elevation': 2267,
                 'elevation_unit': 'm',
                 'location': {'park': 'Yoho National Park',
                  'city': 'Field',
                  'province': 'British Columbia',
                  'country': 'Canada'}}}}),
             (UUID('7c83cf46-05fc-491d-9667-20acf68fe70f'),
              {'nodes': {'semantic_id': 'arafura_swamp',
                'category': 'natural_feature',
                'attributes': {'name': 'Arafura Swamp',
                 'type': 'inland freshwater wetland',
                 'location': {'region': 'Arnhem Land',
                  'territory': 'Northern Territory',
                  'country': 'Australia'},
                 'size': {'area': {'value': None, 'unit': 'km2'},
                  'expansion_during_wet_season': True},
                 'description': 'a near pristine floodplain, possibly the largest wooded swamp in the Northern Territory and Australia',
                 'cultural_significance': 'of great cultural significance to the Yolngu people, in particular the Ramingining community',
                 'filming_location': 'Ten Canoes'}}}),
             (UUID('f39070c7-1d59-4d1b-a4a4-c8a18c222f85'),
              {'nodes': {'semantic_id': 'wapizagonke_lake',
                'category': 'natural_feature',
                'attributes': {'name': 'Wapizagonke Lake',
                 'type': 'body of water',
                 'location': {'sector': 'Lac-Wapizagonke',
                  'city': 'Shawinigan',
                  'park': 'La Mauricie National Park',
                  'region': 'Mauricie',
                  'province': 'Quebec',
                  'country': 'Canada'}}}}),
             (UUID('9a8a31e6-d311-4085-845d-48ae33707b51'),
              {'nodes': {'semantic_id': 'amursky_district',
                'category': 'administrative_district',
                'attributes': {'name': 'Amursky District',
                 'country': 'Russia',
                 'region': 'Khabarovsk Krai'}}}),
             (UUID('f85fb9ae-7e0f-46b2-b039-63ee01e6ce5d'),
              {'nodes': {'semantic_id': 'khabarovsky_district',
                'category': 'administrative_district',
                'attributes': {'name': 'Khabarovsky District',
                 'country': 'Russia',
                 'region': 'Khabarovsk Krai',
                 'area': None,
                 'area_unit': None,
                 'administrative_center': 'Khabarovsk'}}}),
             (UUID('21425234-233f-4f40-b6b3-98e818755151'),
              {'edges': {'from_node': UUID('f85fb9ae-7e0f-46b2-b039-63ee01e6ce5d'),
                'to_node': UUID('9a8a31e6-d311-4085-845d-48ae33707b51'),
                'category': 'separated_by'}}),
             (UUID('43464202-f216-469b-94d2-8ca7ad3d92f1'),
              {'nodes': {'semantic_id': 'silver_lake',
                'category': 'natural_feature',
                'attributes': {'name': 'Silver Lake',
                 'type': 'body of water',
                 'location': {'county': 'Cheshire County',
                  'state': 'New Hampshire',
                  'country': 'United States',
                  'towns': ['Harrisville', 'Nelson']},
                 'outflows': ['Minnewawa Brook', 'The Branch'],
                 'ultimate_recipient': 'Connecticut River'}}}),
             (UUID('f938374c-ec1e-49b0-b049-91257d6ae64d'),
              {'nodes': {'semantic_id': 'hyderabad_police_area',
                'category': 'administrative_district',
                'attributes': {'name': 'Hyderabad Police area',
                 'jurisdiction_size': 'smallest'}}}),
             (UUID('595a2f04-abae-4845-b45c-6df6a8ed9ab5'),
              {'nodes': {'semantic_id': 'hyderabad_district',
                'category': 'administrative_district',
                'attributes': {'name': 'Hyderabad district',
                 'jurisdiction_size': 'second_smallest'}}}),
             (UUID('b3beb831-6954-4ca9-87f3-052767e60856'),
              {'edges': {'from_node': UUID('f938374c-ec1e-49b0-b049-91257d6ae64d'),
                'to_node': UUID('595a2f04-abae-4845-b45c-6df6a8ed9ab5'),
                'category': 'jurisdiction_size_hierarchy'}}),
             (UUID('252323ba-7590-4b55-ae43-e6b9da12ff9e'),
              {'edges': {'from_node': UUID('595a2f04-abae-4845-b45c-6df6a8ed9ab5'),
                'to_node': UUID('631d3937-3f47-4598-8f45-bdb90d5eb91f'),
                'category': 'jurisdiction_size_hierarchy'}}),
             (UUID('723c0ccc-8f09-43ad-bbcf-97b08d5c1bd8'),
              {'edges': {'from_node': UUID('631d3937-3f47-4598-8f45-bdb90d5eb91f'),
                'to_node': UUID('4c681f75-6f72-4771-831a-4aa16149195a'),
                'category': 'jurisdiction_size_hierarchy'}}),
             (UUID('4c681f75-6f72-4771-831a-4aa16149195a'),
              {'nodes': {'semantic_id': 'hmda_area',
                'category': 'administrative_district',
                'attributes': {'name': 'Hyderabad Metropolitan Development Authority (HMDA) area',
                 'jurisdiction_size': 'largest',
                 'type': 'urban_planning_agency',
                 'apolitical': True,
                 'covers': ['ghmc_area', 'suburbs_of_ghmc_area']}}}),
             (UUID('631d3937-3f47-4598-8f45-bdb90d5eb91f'),
              {'nodes': {'semantic_id': 'ghmc_area',
                'category': 'administrative_district',
                'attributes': {'name': 'GHMC area',
                 'jurisdiction_size': 'second_largest',
                 'alternate_name': 'Hyderabad city'}}}),
             (UUID('90c5ce53-79aa-4aaf-a27d-7400e6ac1c08'),
              {'nodes': {'semantic_id': 'suburbs_of_ghmc_area',
                'category': 'administrative_district',
                'attributes': {'name': 'Suburbs of GHMC area',
                 'jurisdiction_size': 'medium',
                 'type': 'residential'}}}),
             (UUID('9a4b69f8-749a-4b7c-a3e3-e2db4f3823d1'),
              {'nodes': {'semantic_id': 'hmwssb',
                'category': 'administrative_body',
                'attributes': {'name': 'Hyderabad Metropolitan Water Supply and Sewerage Board',
                 'type': 'water_management'}}}),
             (UUID('c48817dd-b382-4db9-ad18-d3e2190cdcf5'),
              {'edges': {'from_node': UUID('4c681f75-6f72-4771-831a-4aa16149195a'),
                'to_node': UUID('631d3937-3f47-4598-8f45-bdb90d5eb91f'),
                'category': 'jurisdiction_size_hierarchy'}}),
             (UUID('4b64fe19-941d-44ce-8cd0-8f94f503dea5'),
              {'edges': {'from_node': UUID('4c681f75-6f72-4771-831a-4aa16149195a'),
                'to_node': UUID('90c5ce53-79aa-4aaf-a27d-7400e6ac1c08'),
                'category': 'jurisdiction_size_hierarchy'}}),
             (UUID('268f7e8f-a2c7-4a83-94b9-24baeca1be73'),
              {'edges': {'from_node': UUID('4c681f75-6f72-4771-831a-4aa16149195a'),
                'to_node': UUID('9a4b69f8-749a-4b7c-a3e3-e2db4f3823d1'),
                'category': 'manages'}}),
             (UUID('7956f84b-20a8-4836-ae7a-c7311d716cd1'),
              {'nodes': {'semantic_id': 'san_juan',
                'category': 'city',
                'attributes': {'name': 'San Juan',
                 'location': {'country': 'Puerto Rico',
                  'region': 'north-eastern coast'},
                 'borders': {'north': 'Atlantic Ocean',
                  'south': ['Caguas', 'Trujillo Alto'],
                  'east': ['Carolina'],
                  'west': ['Guaynabo']},
                 'area': {'value': 76.93, 'unit': 'square miles'},
                 'water_bodies': ['San Juan Bay',
                  'Condado Lagoon',
                  'San José Lagoon'],
                 'water_area': {'value': 29.11,
                  'unit': 'square miles',
                  'percentage': 37.83}}}}),
             (UUID('75abfc38-a7ea-4db7-9c2b-5dddaf51c493'),
              {'nodes': {'semantic_id': 'urban_hinterland',
                'category': 'administrative_district',
                'attributes': {'name': 'Urban hinterland',
                 'type': 'urban_area'}}}),
             (UUID('6f12633f-2502-4467-a29c-ebe5b0699810'),
              {'nodes': {'semantic_id': 'kreisfreie_stadte',
                'category': 'administrative_district',
                'attributes': {'name': 'Kreisfreie Städte',
                 'type': 'district-free_city_or_town'}}}),
             (UUID('e7748608-55fe-4ad2-b17a-83af70f4fc73'),
              {'nodes': {'semantic_id': 'landkreise_amalgamation',
                'category': 'administrative_district',
                'attributes': {'name': 'Local associations of a special kind',
                 'type': 'amalgamation_of_districts',
                 'purpose': 'simplification_of_administration'}}}),
             (UUID('0114478c-292d-43e4-940c-349f0b8d8060'),
              {'edges': {'from_node': UUID('6f12633f-2502-4467-a29c-ebe5b0699810'),
                'to_node': UUID('75abfc38-a7ea-4db7-9c2b-5dddaf51c493'),
                'category': 'grouping'}}),
             (UUID('9c504b5a-98b9-4620-9d56-0c18fe35005f'),
              {'edges': {'from_node': UUID('e7748608-55fe-4ad2-b17a-83af70f4fc73'),
                'to_node': UUID('75abfc38-a7ea-4db7-9c2b-5dddaf51c493'),
                'category': 'comprises'}}),
             (UUID('6b27c2fc-9694-4f6e-b61c-425360f1c8f7'),
              {'edges': {'from_node': UUID('e7748608-55fe-4ad2-b17a-83af70f4fc73'),
                'to_node': UUID('6f12633f-2502-4467-a29c-ebe5b0699810'),
                'category': 'comprises'}}),
             (UUID('7715b916-5807-45f4-8408-2770897a7581'),
              {'nodes': {'semantic_id': 'norfolk_island',
                'category': 'island',
                'attributes': {'name': 'Norfolk Island',
                 'location': {'ocean': 'South Pacific Ocean',
                  'relative_location': 'east of Australian mainland'},
                 'coordinates': {'latitude': -29.033, 'longitude': 167.95},
                 'area': {'value': 34.6, 'unit': 'square kilometres'},
                 'coastline': {'length': 32, 'unit': 'km'},
                 'highest_point': 'Mount Bates'}}}),
             (UUID('0db9008a-ae7e-4e32-a7b3-ae5c7d8f93d2'),
              {'edges': {'from_node': UUID('7715b916-5807-45f4-8408-2770897a7581'),
                'to_node': UUID('48a3d3e7-34c1-4a64-93ba-72a6108b3e57'),
                'category': 'part_of'}}),
             (UUID('48a3d3e7-34c1-4a64-93ba-72a6108b3e57'),
              {'nodes': {'semantic_id': 'phillip_island',
                'category': 'island',
                'attributes': {'name': 'Phillip Island',
                 'location': {'relation': 'second largest island of the territory',
                  'coordinates': {'latitude': -29.117, 'longitude': 167.95},
                  'distance_from_main_island': {'value': 7,
                   'unit': 'kilometres'}}}}}),
             (UUID('52ec31ee-9fc6-42a1-9e6c-daf0ea0a9390'),
              {'edges': {'from_node': UUID('48a3d3e7-34c1-4a64-93ba-72a6108b3e57'),
                'to_node': UUID('7715b916-5807-45f4-8408-2770897a7581'),
                'category': 'part_of'}}),
             (UUID('08f207c1-6915-4237-ac4e-902815d9cfae'),
              {'nodes': {'semantic_id': 'star_stadium',
                'category': 'stadium',
                'attributes': {'name': 'Star (Zvezda) Stadium',
                 'former_name': 'Lenin Komsomol Stadium',
                 'location': {'city': 'Perm', 'country': 'Russia'},
                 'usage': 'football matches',
                 'home_team': 'FC Amkar Perm',
                 'capacity': 17000,
                 'opened': '1969-06-05'}}}),
             (UUID('5be79bf7-cd2a-487f-8833-36ae11257df8'),
              {'nodes': {'semantic_id': 'perm',
                'category': 'city',
                'attributes': {'name': 'Perm',
                 'location': {'river': 'Kama River',
                  'region': 'Perm Krai',
                  'country': 'Russia',
                  'geography': 'European part of Russia near the Ural Mountains'},
                 'administrative_status': 'administrative centre'}}}),
             (UUID('e9a848ba-35b3-42e4-b9e6-aa0ea8651d92'),
              {'edges': {'from_node': UUID('08f207c1-6915-4237-ac4e-902815d9cfae'),
                'to_node': UUID('5be79bf7-cd2a-487f-8833-36ae11257df8'),
                'category': 'located_in'}}),
             (UUID('57c8adca-9dcc-4257-be85-bfb48eacd310'),
              {'nodes': {'semantic_id': 'paea',
                'category': 'municipality',
                'attributes': {'name': 'Paea',
                 'location': {'island': 'Tahiti',
                  'subdivision': 'Windward Islands',
                  'region': 'Society Islands',
                  'country': 'French Polynesia',
                  'territory': 'France'},
                 'population': 13021}}}),
             (UUID('62866902-6285-4c38-98b0-7496dbe73fd3'),
              {'nodes': {'semantic_id': 'potamogeton_amplifolius',
                'category': 'plant',
                'attributes': {'common_names': ['largeleaf pondweed',
                  'broad-leaved pondweed'],
                 'habitat': ['lakes', 'ponds', 'rivers'],
                 'water_depth': 'often in deep water',
                 'distribution': 'North America'}}}),
             (UUID('0da6e66c-64c6-4155-bb6f-88e1b0c9a349'),
              {'nodes': {'semantic_id': 'soltonsky_district',
                'category': 'administrative_district',
                'attributes': {'name': 'Soltonsky District',
                 'location': {'region': 'Altai Krai', 'country': 'Russia'}}}}),
             (UUID('2bdd9668-a653-4815-bcde-f43dcf5ff4a5'),
              {'nodes': {'semantic_id': 'krasnogorsky_district',
                'category': 'administrative_district',
                'attributes': {'name': 'Krasnogorsky District',
                 'location': {'region': 'Altai Krai', 'country': 'Russia'}}}}),
             (UUID('e9620276-009f-4f5a-99b6-b41d93fbe791'),
              {'nodes': {'semantic_id': 'sovetsky_district',
                'category': 'administrative_district',
                'attributes': {'name': 'Sovetsky District',
                 'location': {'region': 'Altai Krai', 'country': 'Russia'}}}}),
             (UUID('4bdbb329-0688-4621-9c67-e1af0e8d57fe'),
              {'nodes': {'semantic_id': 'smolensky_district',
                'category': 'administrative_district',
                'attributes': {'name': 'Smolensky District',
                 'location': {'region': 'Altai Krai', 'country': 'Russia'}}}}),
             (UUID('d3fca76e-6cb9-47b7-8fdf-282cd0de4bee'),
              {'nodes': {'semantic_id': 'biysky_district',
                'category': 'administrative_district',
                'attributes': {'name': 'Biysky District',
                 'location': {'region': 'Altai Krai',
                  'country': 'Russia',
                  'geography': 'east of the krai'},
                 'administrative_status': 'administrative and municipal district (raion)',
                 'bordering_districts': ['Soltonsky_district',
                  'Krasnogorsky_district',
                  'Sovetsky_district',
                  'Smolensky_district',
                  'City_of_Biysk']}}}),
             (UUID('a375ae9a-9282-4916-9e6b-02da8a824e3f'),
              {'nodes': {'semantic_id': 'city_of_biysk',
                'category': 'city',
                'attributes': {'name': 'Biysk',
                 'location': {'region': 'Altai Krai', 'country': 'Russia'},
                 'administrative_status': 'administrative center'}}}),
             (UUID('6674891e-b1e6-4c85-9821-54b4a7fd923a'),
              {'edges': {'from_node': UUID('d3fca76e-6cb9-47b7-8fdf-282cd0de4bee'),
                'to_node': UUID('0da6e66c-64c6-4155-bb6f-88e1b0c9a349'),
                'category': 'bordering'}}),
             (UUID('5f81b958-60c6-40a5-8bea-9fdacae9f671'),
              {'edges': {'from_node': UUID('d3fca76e-6cb9-47b7-8fdf-282cd0de4bee'),
                'to_node': UUID('2bdd9668-a653-4815-bcde-f43dcf5ff4a5'),
                'category': 'bordering'}}),
             (UUID('8077d42e-5e70-44dd-aade-c764087f139f'),
              {'edges': {'from_node': UUID('d3fca76e-6cb9-47b7-8fdf-282cd0de4bee'),
                'to_node': UUID('e9620276-009f-4f5a-99b6-b41d93fbe791'),
                'category': 'bordering'}}),
             (UUID('abe4eb3e-0a01-48f2-b97d-67a30519a4d3'),
              {'edges': {'from_node': UUID('d3fca76e-6cb9-47b7-8fdf-282cd0de4bee'),
                'to_node': UUID('4bdbb329-0688-4621-9c67-e1af0e8d57fe'),
                'category': 'bordering'}}),
             (UUID('a27a6ceb-536d-4dbe-9798-e5d86e9755c6'),
              {'edges': {'from_node': UUID('d3fca76e-6cb9-47b7-8fdf-282cd0de4bee'),
                'to_node': UUID('a375ae9a-9282-4916-9e6b-02da8a824e3f'),
                'category': 'bordering'}}),
             (UUID('3d2af122-d4b9-47f1-a034-c9f23e262e14'),
              {'nodes': {'semantic_id': 'contoocook_lake',
                'category': 'lake',
                'attributes': {'name': 'Contoocook Lake',
                 'location': {'county': 'Cheshire County',
                  'state': 'New Hampshire',
                  'country': 'United States',
                  'towns': ['Jaffrey', 'Rindge']},
                 'connection': {'to': 'pool_pond',
                  'type': 'forms_headwaters_of'},
                 'outflow': {'to': 'contoocook_river', 'direction': 'north'},
                 'outflow_destination': 'merrimack_river'}}}),
             (UUID('1eafa7a8-a830-4472-8b32-c071159c8140'),
              {'nodes': {'semantic_id': 'pool_pond',
                'category': 'lake',
                'attributes': {'name': 'Pool Pond',
                 'connection': {'to': 'contoocook_lake',
                  'type': 'forms_headwaters_of'}}}}),
             (UUID('f841df5f-a4ff-4a6b-8656-1458252aca37'),
              {'edges': {'from_node': UUID('3d2af122-d4b9-47f1-a034-c9f23e262e14'),
                'to_node': UUID('1eafa7a8-a830-4472-8b32-c071159c8140'),
                'category': 'forms_headwaters_of'}}),
             (UUID('d50d66d9-c89d-4c0a-9c61-fa4e856ab2c2'),
              {'edges': {'from_node': UUID('3d2af122-d4b9-47f1-a034-c9f23e262e14'),
                'to_node': UUID('56ae1a37-74f4-486b-b517-34b99027ba36'),
                'category': 'outflows_to'}}),
             (UUID('56ae1a37-74f4-486b-b517-34b99027ba36'),
              {'nodes': {'semantic_id': 'contoocook_river',
                'category': 'river',
                'attributes': {'name': 'Contoocook River',
                 'flow_direction': 'north',
                 'outflow_destination': 'merrimack_river'}}}),
             (UUID('0c92123d-fbc3-47e7-b752-65df1d6680c0'),
              {'nodes': {'semantic_id': 'merrimack_river',
                'category': 'river',
                'attributes': {'name': 'Merrimack River',
                 'location': {'city': 'Penacook',
                  'state': 'New Hampshire',
                  'country': 'United States'}}}}),
             (UUID('6e9b6a2d-926e-4b20-97ab-d8bb217a5029'),
              {'edges': {'from_node': UUID('56ae1a37-74f4-486b-b517-34b99027ba36'),
                'to_node': UUID('0c92123d-fbc3-47e7-b752-65df1d6680c0'),
                'category': 'flows_into'}}),
             (UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
              {'nodes': {'semantic_id': 'bogota',
                'category': 'city',
                'attributes': {'name': 'Bogotá',
                 'pronunciation': {'english': ['/ˈboʊɡəˌtɑː/',
                   '/ˌboʊ-/',
                   '/ˈbɔɪ-/'],
                  'spanish': 'boˈɣota'},
                 'official_name': 'Bogotá',
                 'administration': 'Capital District'}}}),
             (UUID('adc293cf-401b-49b6-928d-ada04da0e4bf'),
              {'nodes': {'semantic_id': 'political_center',
                'category': 'function',
                'attributes': {'name': 'political center',
                 'location': 'Bogotá'}}}),
             (UUID('108added-3b0f-4b67-9e28-fce89a168e46'),
              {'nodes': {'semantic_id': 'economic_center',
                'category': 'function',
                'attributes': {'name': 'economic center',
                 'location': 'Bogotá'}}}),
             (UUID('7fad3f22-d955-4f54-a1e2-5683bf4639e8'),
              {'nodes': {'semantic_id': 'administrative_center',
                'category': 'function',
                'attributes': {'name': 'administrative center',
                 'location': 'Bogotá'}}}),
             (UUID('404e8a42-e9e9-4081-9890-669ba48d1b36'),
              {'nodes': {'semantic_id': 'industrial_center',
                'category': 'function',
                'attributes': {'name': 'industrial center',
                 'location': 'Bogotá'}}}),
             (UUID('8b80ba4f-77b2-4ee6-8548-a66108963fb7'),
              {'nodes': {'semantic_id': 'artistic_center',
                'category': 'function',
                'attributes': {'name': 'artistic center',
                 'location': 'Bogotá'}}}),
             (UUID('056f1544-f292-4506-ba8d-18d6e294d433'),
              {'nodes': {'semantic_id': 'cultural_center',
                'category': 'function',
                'attributes': {'name': 'cultural center',
                 'location': 'Bogotá'}}}),
             (UUID('d747fcf6-5b60-4595-a127-f3e247cbb8a3'),
              {'nodes': {'semantic_id': 'sports_center',
                'category': 'function',
                'attributes': {'name': 'sports center',
                 'location': 'Bogotá'}}}),
             (UUID('59719cce-5c68-47cf-8111-481c74e73c3b'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('8ad7ae8e-ed3a-415f-8d0e-a8984bd7717e'),
                'category': 'capital_of'}}),
             (UUID('0bc7053b-2183-41df-995a-faecd919ad45'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('5576789f-9306-4a82-8736-882db81abdf2'),
                'category': 'part_of'}}),
             (UUID('73230839-050b-4ace-9dd1-7927b1bd1034'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('adc293cf-401b-49b6-928d-ada04da0e4bf'),
                'category': 'functions_as'}}),
             (UUID('04b8ded0-c2c1-441f-b7f7-069b05f80338'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('108added-3b0f-4b67-9e28-fce89a168e46'),
                'category': 'functions_as'}}),
             (UUID('0a86d51a-bd57-4a63-b494-08075a2bcb4a'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('7fad3f22-d955-4f54-a1e2-5683bf4639e8'),
                'category': 'functions_as'}}),
             (UUID('58c0a891-5c05-48e5-ba2a-1c230b588997'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('404e8a42-e9e9-4081-9890-669ba48d1b36'),
                'category': 'functions_as'}}),
             (UUID('5bad3267-dceb-4067-8d02-bc8c748d50d3'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('8b80ba4f-77b2-4ee6-8548-a66108963fb7'),
                'category': 'functions_as'}}),
             (UUID('d80bed21-cd68-454d-a142-223fb30d5db6'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('056f1544-f292-4506-ba8d-18d6e294d433'),
                'category': 'functions_as'}}),
             (UUID('6c12a6ad-244d-4d7e-8c25-f3a6288967fb'),
              {'edges': {'from_node': UUID('2de55d21-12b4-493e-954a-acf0b7bf4ac2'),
                'to_node': UUID('d747fcf6-5b60-4595-a127-f3e247cbb8a3'),
                'category': 'functions_as'}}),
             (UUID('87517804-e5a2-44d9-82bd-23b4e33c2a40'),
              {'nodes': {'semantic_id': 'bogota',
                'category': 'city',
                'attributes': {'name': 'Bogotá',
                 'functions': ['political center',
                  'economic center',
                  'administrative center',
                  'industrial center',
                  'artistic center',
                  'cultural center',
                  'sports center']}}}),
             (UUID('8ad7ae8e-ed3a-415f-8d0e-a8984bd7717e'),
              {'nodes': {'semantic_id': 'colombia',
                'category': 'country',
                'attributes': {'name': 'Colombia',
                 'capital': 'Bogotá',
                 'status': 'capital and largest city'}}}),
             (UUID('5576789f-9306-4a82-8736-882db81abdf2'),
              {'nodes': {'semantic_id': 'cundinamarca',
                'category': 'region',
                'attributes': {'name': 'Cundinamarca',
                 'relation_to_bogota': 'often thought of as part of'}}}),
             (UUID('af6b4ebf-7c7c-4dd0-a58e-ab8c292eaf7c'),
              {'edges': {'from_node': UUID('87517804-e5a2-44d9-82bd-23b4e33c2a40'),
                'to_node': UUID('8ad7ae8e-ed3a-415f-8d0e-a8984bd7717e'),
                'category': 'capital_of'}}),
             (UUID('3265632a-4065-46c4-82cf-527b3d4abc13'),
              {'edges': {'from_node': UUID('87517804-e5a2-44d9-82bd-23b4e33c2a40'),
                'to_node': UUID('5576789f-9306-4a82-8736-882db81abdf2'),
                'category': 'part_of'}}),
             (UUID('d97d057d-2564-427d-9703-e77a61ff58c7'),
              {'nodes': {'semantic_id': 'intracellular_fluid',
                'category': 'fluid',
                'attributes': {'name': 'intracellular fluid',
                 'volume': '2/3 of body water',
                 'amount_in_72_kg_body': '25 litres',
                 'percentage_of_total_body_fluid': 62.5}}}),
             (UUID('380f506f-c2cf-453e-879f-fb58b3f3d1db'),
              {'nodes': {'semantic_id': 'body_fluid',
                'category': 'fluid',
                'attributes': {'name': 'body fluid',
                 'total_volume_in_72_kg_body': '40 litres'}}}),
             (UUID('1b0bebc1-49a1-419f-bfb2-d50cffeed740'),
              {'edges': {'from_node': UUID('d97d057d-2564-427d-9703-e77a61ff58c7'),
                'to_node': UUID('380f506f-c2cf-453e-879f-fb58b3f3d1db'),
                'category': 'part_of'}}),
             (UUID('6e5f08dc-fe95-4b74-884d-dcce8470290a'),
              {'nodes': {'semantic_id': 'territorial_waters',
                'category': 'geographic_area',
                'attributes': {'name': 'territorial waters',
                 'definition': 'a belt of coastal waters extending at most 12 nautical miles (22.2 km; 13.8 mi) from the baseline (usually the mean low-water mark) of a coastal state',
                 'source': '1982 United Nations Convention on the Law of the Sea'}}}),
             (UUID('443e77c0-cff1-43e4-89f2-ba748d4421a1'),
              {'edges': {'from_node': UUID('6e5f08dc-fe95-4b74-884d-dcce8470290a'),
                'to_node': UUID('d5019515-a9d9-4d23-89cf-dac81f7d96ea'),
                'category': 'extends_from'}}),
             (UUID('58b64eae-e6c2-47f1-8916-eaf1dcc87e6b'),
              {'edges': {'from_node': UUID('6e5f08dc-fe95-4b74-884d-dcce8470290a'),
                'to_node': UUID('11903999-15a7-4776-8bae-f1803429147f'),
                'category': 'belongs_to'}}),
             (UUID('7b534168-d2e7-498e-9115-5e21d6c638f3'),
              {'nodes': {'semantic_id': 'territorial_sea',
                'category': 'geographic_area',
                'attributes': {'name': 'territorial sea',
                 'definition': 'a belt of coastal waters extending at most 12 nautical miles (22.2 km; 13.8 mi) from the baseline (usually the mean low-water mark) of a coastal state',
                 'sovereign_territory': True,
                 'foreign_ship_passage': 'innocent passage through it or transit passage for straits',
                 'jurisdiction': 'extends to airspace over and seabed below'}}}),
             (UUID('d5019515-a9d9-4d23-89cf-dac81f7d96ea'),
              {'nodes': {'semantic_id': 'baseline',
                'category': 'geographic_feature',
                'attributes': {'name': 'baseline',
                 'definition': 'usually the mean low-water mark of a coastal state'}}}),
             (UUID('11903999-15a7-4776-8bae-f1803429147f'),
              {'nodes': {'semantic_id': 'coastal_state',
                'category': 'legal_entity',
                'attributes': {'name': 'coastal state'}}}),
             (UUID('7a93cd6e-7f97-43fa-9054-22d3e4478ecf'),
              {'edges': {'from_node': UUID('7b534168-d2e7-498e-9115-5e21d6c638f3'),
                'to_node': UUID('11903999-15a7-4776-8bae-f1803429147f'),
                'category': 'belongs_to'}}),
             (UUID('4dec6036-1299-4093-ae68-3b6cecc73053'),
              {'edges': {'from_node': UUID('7b534168-d2e7-498e-9115-5e21d6c638f3'),
                'to_node': UUID('d5019515-a9d9-4d23-89cf-dac81f7d96ea'),
                'category': 'extends_from'}}),
             (UUID('d86a7f75-df06-47f1-a30d-67a921d822bf'),
              {'nodes': {'semantic_id': 'strait',
                'category': 'geographic_feature',
                'attributes': {'name': 'strait',
                 'sovereign_territory': True,
                 'jurisdiction': {'airspace': True, 'seabed': True}}}}),
             (UUID('1baf2bf2-6083-481f-8980-d2f9793f58e6'),
              {'nodes': {'semantic_id': 'maritime_delimitation',
                'category': 'legal_concept',
                'attributes': {'name': 'maritime delimitation',
                 'definition': "Adjustment of the boundaries of a coastal state's territorial sea, exclusive economic zone, or continental shelf"}}}),
             (UUID('396cc35f-de5f-4717-9cb5-2cf511456fb2'),
              {'edges': {'from_node': UUID('d86a7f75-df06-47f1-a30d-67a921d822bf'),
                'to_node': UUID('11903999-15a7-4776-8bae-f1803429147f'),
                'category': 'belongs_to'}}),
             (UUID('88b88486-07be-4ae4-8d51-c2d88ca9f125'),
              {'edges': {'from_node': UUID('1baf2bf2-6083-481f-8980-d2f9793f58e6'),
                'to_node': UUID('11903999-15a7-4776-8bae-f1803429147f'),
                'category': 'involves'}}),
             (UUID('d8b37bae-dbdb-49c8-9e35-6c87c902f3f9'),
              {'nodes': {'semantic_id': 'uninsured_depositor',
                'category': 'stakeholder',
                'attributes': {'deposit_amount': '>100,000 Euro',
                 'treatment': 'subject to a bail-in',
                 'new_role': 'new shareholders of the legacy entity'}}}),
             (UUID('7862ab2f-0f92-48af-ba10-12cdac45a10f'),
              {'nodes': {'semantic_id': 'bank_of_cyprus',
                'category': 'organization',
                'attributes': {'name': 'Bank of Cyprus',
                 'size': 'largest banking group in Cyprus',
                 'relation_to_cyprus_popular_bank': "absorbed the 'good' Cypriot part of Cyprus Popular Bank after it was shuttered"}}}),
             (UUID('1c509f8b-300e-47ac-ad62-4dcf675ca11d'),
              {'nodes': {'semantic_id': 'cyprus_popular_bank',
                'category': 'organization',
                'attributes': {'name': 'Cyprus Popular Bank',
                 'previous_name': 'Marfin Popular Bank',
                 'status': 'shuttered in 2013',
                 'size': 'second largest banking group in Cyprus',
                 'parent': 'Bank of Cyprus'}}}),
             (UUID('82b3a659-7692-4909-a812-fc247f97ed6c'),
              {'edges': {'from_node': UUID('d8b37bae-dbdb-49c8-9e35-6c87c902f3f9'),
                'to_node': UUID('4845f0f8-9a9e-4bf2-b9e6-49ba7ee13b44'),
                'category': 'holds_deposits'}}),
             (UUID('a80831eb-a3c9-4ad6-a6ba-9db107876050'),
              {'edges': {'from_node': UUID('1c509f8b-300e-47ac-ad62-4dcf675ca11d'),
                'to_node': UUID('7862ab2f-0f92-48af-ba10-12cdac45a10f'),
                'category': 'merged_with'}}),
             (UUID('2d393fc3-fe9b-46dd-90f7-2fadb227fccd'),
              {'nodes': {'semantic_id': 'central_bank_of_cyprus',
                'category': 'organization',
                'attributes': {'name': 'Central Bank of Cyprus',
                 'role': 'Governor and Board members amended the lawyers of the legacy entity without consulting the special administrator'}}}),
             (UUID('87a173a3-a32b-4a36-a1b2-6248a92eb14c'),
              {'nodes': {'semantic_id': 'veteran_banker',
                'category': 'stakeholder',
                'attributes': {'name': 'Chris Pavlou',
                 'expertise': 'Treasury'}}}),
             (UUID('7af74b0b-30ab-44c4-8adf-8e19ecf04a14'),
              {'nodes': {'semantic_id': 'special_administrator',
                'category': 'stakeholder',
                'attributes': {'name': 'Andri Antoniadou',
                 'role': 'ran the legacy entity for two years, from March 2013 until 3 March 2015'}}}),
             (UUID('4845f0f8-9a9e-4bf2-b9e6-49ba7ee13b44'),
              {'nodes': {'semantic_id': 'legacy_entity',
                'category': 'organization',
                'attributes': {'description': "the 'bad' part or legacy entity holds all the overseas operations as well as uninsured deposits above 100,000 Euro, old shares and bonds",
                 'ownership_stake': '4.8% of Bank of Cyprus',
                 'board_representation': 'does not hold a board seat',
                 'previous_operations': 'overseas operations of the now defunct Cyprus Popular Bank'}}}),
             (UUID('bfe012be-e584-401f-bd34-f6d147e7831c'),
              {'nodes': {'semantic_id': 'marfin_investment_group',
                'category': 'stakeholder',
                'attributes': {'name': 'Marfin Investment Group',
                 'role': 'former major shareholder of the legacy entity'}}}),
             (UUID('14585bad-087d-4ce7-bbc2-1d89e4cd7548'),
              {'edges': {'from_node': UUID('2d393fc3-fe9b-46dd-90f7-2fadb227fccd'),
                'to_node': UUID('4845f0f8-9a9e-4bf2-b9e6-49ba7ee13b44'),
                'category': 'amended_lawyers_without_consulting'}}),
             (UUID('14eb57de-1b12-42b5-8b91-945cdfd08442'),
              {'edges': {'from_node': UUID('4845f0f8-9a9e-4bf2-b9e6-49ba7ee13b44'),
                'to_node': UUID('7af74b0b-30ab-44c4-8adf-8e19ecf04a14'),
                'category': 'managed_by'}}),
             (UUID('e367aa2a-74bb-426e-b17c-f4ecf2032e6f'),
              {'edges': {'from_node': UUID('87a173a3-a32b-4a36-a1b2-6248a92eb14c'),
                'to_node': UUID('4845f0f8-9a9e-4bf2-b9e6-49ba7ee13b44'),
                'category': 'took_over_as_special_administrator'}}),
             (UUID('1c9504d1-c7e5-4b0b-ad7a-24a03cc9c498'),
              {'edges': {'from_node': UUID('4845f0f8-9a9e-4bf2-b9e6-49ba7ee13b44'),
                'to_node': UUID('bfe012be-e584-401f-bd34-f6d147e7831c'),
                'category': 'pursuing_legal_action_against'}})])

It took ~2 minutes to create generate the graph_historyand 43 calls to the LLM.

A speed boost can definitely be had in multiple ways:

  • Use a different model. The latest Llama 3 model running on Groq infrastructure can yield a 10x speed up in some cases if you use the 8B model. The great thing about using a framework like Langchain is the ease with which you can plug n play different models in your pipelines.

  • Increase the chunk size. If the entire paragraph is passed to the LLM, this will cut down on the 43 multiple calls by roughly half in our case.

  • In addition to increasing the chunk size, we can pass multiple paragraphs to process at the same time – although this would involve prompting the model to extract some paragraph identification which we currently get for free simply by attaching the idx of each paragraph to the nodes it creates.

  • PARALLELIZE IT(though we may lose some of the history tracking)

5.1 Show Me the Money

Or the knowledge graph. Back to rustworkx we go. Some minor tweaks were made to the visualization code we saw earlier in order for all of the nodes and edges to not be concealed by the massive amounts of text we’ve generated. I’ve left the node categories visible. The graph generation code was modified to work with the the history stored within our graph_history object.

import rustworkx as rx
from rustworkx.visualization import mpl_draw

digraph = rx.PyDiGraph()

node_indices = {}
# Iterate through the history to add nodes and edges
for uuid, data in graph_history.history.items():
    if 'nodes' in data:
        # Add node to the graph and store the index with its UUID
        node_index = digraph.add_node(data['nodes'])
        node_indices[uuid] = node_index

for uuid, data in graph_history.history.items():
    if 'edges' in data:
        # Retrieve indices of the from and to nodes using their UUIDs
        from_index = node_indices.get(data['edges']['from_node'])
        to_index = node_indices.get(data['edges']['to_node'])
        if from_index is not None and to_index is not None:
            # Add edge to the graph
            digraph.add_edge(from_index, to_index, data['edges'])

# Visualize the graph with labels based on node and edge categories
layout = rx.digraph_spring_layout(digraph, repulsive_exponent=50, num_iter=200)
mpl_draw(digraph, with_labels=True, pos=layout,
         labels=lambda node: f'{node["category"]}',
        #  edge_labels=lambda edge: f'{edge["category"]}',
         font_size=9, node_size=50)

Messy, but you get the idea.

We can do a little graph analytics to find the most connected nodes(nodes with the most connections). incident_edges(n) identifies the edges of a node with index n, so all we have to do is get the length of the edge list returned and then sort it.

len_list = [len(digraph.incident_edges(n, all_edges=True)) for n in range(len(digraph.node_indices()))]
len_list.sort(reverse=True)
len_list[:]
[9,
 5,
 5,
 4,
 4,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

The most connections a single node has is 9, while most nodes merely have a single connection, and a minority of nodes have no connections. Here is said node:

Code
import rustworkx as rx
from rustworkx.visualization import mpl_draw

# Assuming 'graph' is your existing PyDiGraph object
node_index = 35  # Example node index

# Get predecessors and successors
predecessors = list(digraph.predecessor_indices(node_index))
successors = list(digraph.successor_indices(node_index))

# Include the original node and ensure uniqueness of nodes
subgraph_nodes = list(set([node_index] + predecessors + successors))

# Create the subgraph
subgraph = digraph.subgraph(subgraph_nodes)


# Visualize the graph with labels based on node and edge categories
mpl_draw(subgraph, with_labels=True,
         labels=lambda node: f'{node["category"]}\n{node.get("attributes", "")}',
         edge_labels=lambda edge: f'{edge["category"]}',
         font_size=9)

Do recall, our token window is fairly small, with all of 70 tokens, however, we’re using 600 tokens to store the history of generated nodes and edges which are fed back to the model. Perhaps this a good amount of connectivity given these parameters and the fact that only a few paragraphs should have a connection between them, or perhaps not. The lack of quantifiable best practices in a bleeding edge field is 😢sad but expected.

To jog your memory a bit, here is what 70 tokens looks like:

Figure 4: An example text consisting of 70 tokens

6 Wrapping Up

Well…almost!

Now that we have a workflow for generating knowledge graphs for questions in the MuSiQue dataset, we can move on to attaching a vector database to it in the next post.

Thanks for reading, I hope you managed to stay awake.

Part Three >>>