Research - Open Source
Hierarchical Artifact Decomposition-Recomposition
Requirement traceability, validation, and verification grow increasingly complex as engineering projects scale in size and interdependence. Technical specifications contain valuable structural information in natural language form. Advances in Large Language Models (LLMs) and their effectiveness in natural language processing and reasoning make specifications corpora amenable to information extraction and relational reasoning. While traditional requirements and test engineering methods rely on human expertise, LLMs have shown comparable performance in retrieval-augmented reasoning tasks.
We propose Hierarchical Artifact Decomposition-Recomposition (HADR), a knowledge graph construction method for reasoning-based requirements tracing that decomposes artifacts into hierarchical chunk trees via branch and leaf operations, recomposing context through upward summarization while preserving all artifact content. By explicitly decoupling entity extraction from relation inference, HADR surfaces implicit relational information prior to graph construction, enabling context-aware retrieval and reasoning across linked artifacts. Integrated into a RAG pipeline, the resulting knowledge graph supports scalable, multi-artifact requirements traceability by enabling ranked traceability linkages, knowledge graph search, and summarized reasoning across requirements and test cases.
HADR operates as a linear pipeline. An artifact enters, is decomposed into a depth-two tree, recomposed via summarization, stored in a relational + vector database, then enriched with entities and relationships that are upserted into a knowledge graph.
An artifact \(A\) is first tokenized with o200k_base (tiktoken) then sliced into
overlapping branch chunks of fixed window \(K\) and overlap \(P\). Each branch
chunk is then passed to an LLM with the leaf.j2 prompt, which semantically
partitions it into non-overlapping leaf chunks whose union exactly
reconstructs the branch. The result is a depth-two tree:
root → branches → leaves.
{% set system %}
You are an expert linguist tool that semantically separates text.
Your task is to split the provided text into semantically meaningful
segments without altering the original content.
…
leaf.j2 prompt{#---
description: A template for semantic text chunking tasks.
author: Pablo B., Marlon S.
---#}
{% set system %}
You are an expert linguist tool that semantically separates text.
Your task is to split the provided text into semantically meaningful
segments without altering the original content. Follow these
instructions carefully:
---Input---
- Preserve the original text exactly: wording, spelling, punctuation,
whitespace, line breaks, and formatting must remain unchanged
within each chunk.
- The text may include markdown, code blocks (including triple
backticks), tables, LaTeX, visual-to-text transcription, or any
other formats.
- If the input text contains the sequence <|>, replace every
occurrence with [PIPE] before chunking to avoid confusion with
the output delimiters.
---Task---
- Carefully break up the text into logical and semantically meaningful
parts. Each chunk should represent a single, coherent idea or unit
of meaning (such as a sentence, paragraph, or code block).
- Treat code blocks, tables, or other structured elements as atomic
units where appropriate.
- Do not add any commentary, explanation, or preamble before or after
the output.
- Do not add or remove any formatting characters such as triple
backticks or newline characters inside the chunks.
- Return only the chunks concatenated together, each enclosed by the
delimiters <|> with no spaces or newlines between chunks.
---Output Format---
<|>ChunkOne<|><|>ChunkTwo<|><|>LastChunk<|>
{% endset %}
{% set user %}
Semantically segment the following text into distinct meaning units,
where each chunk represents a single coherent idea or topic.
---Input Text---
{{ text }}
{% endset %}
Summaries propagate upward: the leaf chunks within each branch subtree are condensed into a
branch-level summary via the summary.j2 prompt. Branch summaries are then
aggregated into a single artifact-level summary. HADR is information-preserving, thus all original chunks remain alongside their summaries for downstream use.
{% set system %}
You are an expert summary tool tasked with writing a concise and
comprehensive summary of the input text.
…
summary.j2 prompt{#---
description: A template for text summarization tasks.
author: Pablo B., Marlon S.
---#}
{% set system %}
You are an expert summary tool tasked with writing a concise and
comprehensive summary of the input text. Follow these instructions
carefully:
---Input---
- The text may include markdown, code blocks (including triple
backticks and newline characters), tables, LaTeX, visual-to-text
transcription, or any other formats. When summarizing, convert
these elements into clear, descriptive writing rather than
including raw formatting or code, unless the code or formatting
is critical to understanding the content.
---Task---
- Summarize the input text by including all key details and
important points without adding any information not present in
the original text.
- Do not add any commentary, explanation, or preamble before or
after the output.
- Keep the summary concise, ideally within 3-5 sentences.
{% endset %}
{% set user %}
Write a summary of the following text, including as many key
details as possible.
---Input Text---
{{ text }}
{% endset %}
All artifacts, branches, leaves, embeddings, and summaries are streamed into a
SQLite database (master.db) using ijson for
constant-memory ingestion. Embeddings are simultaneously upserted into
Qdrant (cosine similarity, 3,072-dim vectors from
gemini-embedding-001) to enable semantic nearest-neighbor search across
branches and leaves. The UUID hierarchy encodes the tree structure:
041f16b3-…-d300
041f16b3-…-d300.1
041f16b3-…-d300.1.3
CREATE TABLE IF NOT EXISTS artifacts (
uuid TEXT PRIMARY KEY,
path TEXT NOT NULL DEFAULT '',
…
CREATE TABLE IF NOT EXISTS artifacts (
uuid TEXT PRIMARY KEY,
path TEXT NOT NULL DEFAULT '',
type TEXT NOT NULL DEFAULT '',
summary TEXT
);
CREATE TABLE IF NOT EXISTS branches (
uuid TEXT PRIMARY KEY,
artifact_uuid TEXT NOT NULL REFERENCES artifacts(uuid),
chunk TEXT NOT NULL DEFAULT '',
path TEXT NOT NULL DEFAULT '',
summary TEXT
);
CREATE TABLE IF NOT EXISTS branch_embeddings (
branch_uuid TEXT PRIMARY KEY REFERENCES branches(uuid),
embedding TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS leafs (
uuid TEXT PRIMARY KEY,
branch_uuid TEXT NOT NULL REFERENCES branches(uuid),
text TEXT NOT NULL DEFAULT '',
entities TEXT
);
CREATE TABLE IF NOT EXISTS leaf_embeddings (
leaf_uuid TEXT PRIMARY KEY REFERENCES leafs(uuid),
embedding TEXT NOT NULL
);
class VectorDB():
def __init__(self, port: int, host: str):
self.client = self.connect(port, host)
…
qdrant.pyfrom qdrant_client import QdrantClient
from tuutrag.utils import log_timestamp
from pathlib import Path
IMAGE = "qdrant/qdrant:latest"
class VectorDB():
def __init__(self, port: int, host: str):
self.client = self.connect(port, host)
def connect(self, port: int, host: str):
port = int(port)
host = str(host)
storage = str(Path(__file__).parent / "data/qdrant")
try:
client = QdrantClient(
port=port, host=host, timeout=50, https=False
)
log_timestamp(f"Connected to http://{host}:{port}/dashboard")
return client
except Exception as e:
raise RuntimeError(f"Error connecting to Qdrant: {e}")
def create_collection(self, collection_name, vector_params):
if self.client.collection_exists(
collection_name=collection_name
):
return
self.client.create_collection(
collection_name=collection_name,
vectors_config=vector_params,
)
def upsert(self, collection_name: str, point):
self.client.upsert(
collection_name=collection_name,
points=point,
wait=True,
)
HADR explicitly decouples entity extraction from relation inference and applies relation discovery at three progressively wider scopes:
Entities are extracted at the leaf level via the entity.j2
prompt where each leaf is processed independently, and the results written back as a JSON array.
Local relations are inferred per-branch from the union of that branch's leaf
entities using relation_local.j2, capturing relationships within a single branch.
Global relations widen the scope to an entire artifact: embedding-similarity
search across branches within the same document tree identifies semantically related chunks,
and cross-set relation inference is performed using relation_global_uni.j2.
Finally, universal relations span the full corpus: the same similarity search
and relation_global_uni.j2 prompt are applied across branches from
all ingested artifacts, surfacing cross-document dependencies that would otherwise
remain implicit.
Per-leaf · entity.j2
Within branch · relation_local.j2
Within artifact · relation_global_uni.j2
Across all artifacts · relation_global_uni.j2
{% set system %}
You are an entity extraction tool. Extract entities only from the
section labeled "Input Text".
…
entity.j2 prompt{#---
description: A template for named entity extraction tasks.
author: Pablo B., Marlon S.
---#}
{% set system %}
You are an entity extraction tool. Extract entities only from the
section labeled "Input Text". Follow these instructions carefully:
---Input---
- The entire input text represents a meaningful idea (such as a
sentence, paragraph, or code block).
- If the entity name is case-insensitive, capitalize the first
letter of each significant word (title case). Ensure consistent
naming across the entire extraction process.
- If the input text contains the sequence <|>, replace it
with [PIPE] to avoid confusion with the entity delimiter.
---Task---
- Identify all meaningful entities only from the input text.
- Do not add any commentary, explanation, or preamble.
- Return only the identified entities in the following format,
with each entity enclosed by <|> delimiters.
---Output Format---
<|>EntityOne<|><|>EntityTwo<|><|>EntityThree<|>
---Example---
Input: "On March 14, 2023, Dr. Evelyn Hart of the Johns Hopkins
Neurology Department filed Case No. 2023-MED-0047 with the FDA
citing adverse reactions to Compound XR-9B, developed under NIH
Grant #R01NS112233, affecting patients in the Phase III TriNeuro
Trial across facilities in Baltimore, MD and Denver, CO."
Output: <|>March 14, 2023<|><|>Dr. Evelyn Hart<|>
<|>Johns Hopkins Neurology Department<|><|>Johns Hopkins<|>
<|>Case No. 2023-MED-0047<|><|>FDA<|><|>Compound XR-9B<|>
<|>NIH Grant #R01NS112233<|><|>NIH<|>
<|>Phase III TriNeuro Trial<|><|>Baltimore<|><|>MD<|>
<|>Denver<|><|>CO<|>
{% endset %}
{% set user %}
Extract all named entities from the following text.
---Input Text---
{{ text }}
{% endset %}
{% set system %}
You are a relationship identification tool. Identify relationships
only from the section labeled "Entities".
…
relation_local.j2 prompt{#---
description: A template for local relationship identification
tasks between entities.
author: Pablo B., Marlon S.
---#}
{% set system %}
You are a relationship identification tool. Identify relationships
only from the section labeled "Entities". "Raw Chunk" may be used
for disambiguation only. Follow these instructions carefully:
---Input---
- Entities are provided as a list, where each list item represents
a single coherent idea.
- If the input text contains the sequence <|>, replace it
with [PIPE] to avoid confusion with the entity delimiter.
---Task---
- Identify meaningful relationships between entities based on
the provided information. Only extract relationships that are
explicitly stated or directly implied.
- If a single statement describes a relationship involving more
than two entities (an N-ary relationship), decompose it into
multiple binary (two-entity) pairs.
E.g. "REQ-12 and REQ-13 both derive from NEED-04" yields:
"REQ-12, derives-from, NEED-04" and
"REQ-13, derives-from, NEED-04."
- Relationships are directional: source → target.
- Do not add any commentary or preamble.
---Output Format---
<|>Source, relationship-type, Target<|>
{% endset %}
{% set user %}
Identify all relationships that are explicitly stated or directly
implied across the "Entities".
---Raw Chunk---
{{ raw_chunk }}
---Entities---
{{ entities }}
{% endset %}
{% set system %}
You are a relationship identification tool. Identify relationships
only from the sections labeled "Entities".
…
relation_global_uni.j2 prompt{#---
description: A template for global relationship identification
tasks between entities across multiple sets.
This template is identical to relation_global.j2.
Both map to the same underlying prompt.
author: Pablo B., Marlon S.
---#}
{% set system %}
You are a relationship identification tool. Identify relationships
only from the sections labeled "Entities". Each "Relations",
"Raw Chunk" and "Artifact Summary" may be used for disambiguation
of their respective set only. Follow these instructions carefully:
---Input---
Input is provided across K numbered sets, each formatted as:
[SET {n}]
Entities: <comma-separated list>
Relations: <"<|>"-comma separated list>
Raw Chunk: <source text>
Artifact Summary: <artifact-level summary>
- Treat each entity as scoped to its own set.
- If the input text contains the sequence <|>, replace it
with [PIPE].
---Task---
- Identify meaningful relationships between entities across
different sets. Do not extract relationships within the same set.
- Only extract relationships explicitly stated or directly implied.
- Decompose N-ary relationships into binary pairs.
- Relationships are directional: source → target.
---Output Format---
<|>Source, relationship-type, Target<|>
---Example---
Input:
[SET 1]
Entities: Dr. Evelyn Hart, Johns Hopkins, Case No. 2023-MED-0047, FDA
Relations: <|>Dr. Evelyn Hart, works at, Johns Hopkins<|>
<|>Case No. 2023-MED-0047, reviewed by, FDA<|>
Raw Chunk: "…filed Case No. 2023-MED-0047 with the FDA citing
adverse reactions to Compound XR-9B."
Artifact Summary: "FDA case filing regarding drug adverse events."
[SET 2]
Entities: Compound XR-9B, NIH, Phase III TriNeuro Trial
Relations: <|>Compound XR-9B, funded by, NIH Grant #R01NS112233<|>
<|>Compound XR-9B, tested in, Phase III TriNeuro Trial<|>
Raw Chunk: "Compound XR-9B…was the subject of an adverse event report…
The Phase III TriNeuro Trial was subsequently suspended."
Artifact Summary: "NIH-funded clinical trial documentation."
Output:
<|>Case No. 2023-MED-0047, concerns, Compound XR-9B<|>
<|>Dr. Evelyn Hart, reported, Compound XR-9B<|>
<|>FDA, suspended, Phase III TriNeuro Trial<|>
{% endset %}
{% set user %}
Identify all relationships that are explicitly stated or directly
implied across the "Entities".
---Raw Chunk---
{{ raw_chunk }}
---Entities---
{{ entities }}
---Relations---
{{ relations }}
---Artifact Summary---
{{ artifact_summary }}
{% endset %}
All extracted triples (local, global, and universal) are merged and upserted into
Memgraph using the Bolt protocol. Each entity becomes a :Entity
node, each relationship a directed :RELATIONSHIP edge with a type
property. The MemgraphConnection class parses <|>-delimited
triples from JSONL and executes idempotent MERGE Cypher queries.
@export
class MemgraphConnection:
obj_name = "MemgraphConnection"
workspace = "kg"
…
memgraph.pyimport json
from typing import List
from exports import export
from neo4j import GraphDatabase
@export
class MemgraphConnection:
obj_name = "MemgraphConnection"
workspace = "kg"
def __init__(self, port: int, frontend_port: int, host: str):
self.driver = self.__connect(port, frontend_port, host)
def __connect(self, port, frontend_port, host):
URI = f"bolt://{host}:{int(port)}"
driver = GraphDatabase.driver(URI, auth=("", ""))
driver.verify_connectivity()
print(f"Memgraph UI → http://{host}:{frontend_port}/")
print(f"Memgraph Bolt → {URI}")
return driver
def read_data(self, file_path: str) -> List:
parts = []
with open(file_path, "r", encoding="utf-8") as f:
for line in f:
data = json.loads(line)
content = data["text"].split("<|>")[1]
part = [i.strip() for i in content.split(",")]
if len(part) == 3:
parts.append(part)
return parts
def upsert(self, data: List) -> None:
with self.driver.session() as session:
for i, part in enumerate(data):
session.run("""
MERGE (source:Entity {name: $source})
MERGE (target:Entity {name: $target})
MERGE (source)-[r:RELATIONSHIP]->(target)
SET r.type = $relationship
""", source=part[0],
relationship=part[1],
target=part[2])
print(f"upserted {i+1}/{len(data)}: "
f"{part[0]} -[{part[1]}]-> {part[2]}")
The codebase is organized into a compact Python package (tuutrag/) and a set
of Jinja2 prompt templates. Below are the core building blocks that power every pipeline stage.
BranchChunk, BatchRequest, Message, RequestBody — typed dicts enforcing the data contract across stages.
DataManager — resolves file paths across raw/, interim/, processed/, and api/ directories. Singleton instance D.
create_batch_request() — assembles OpenAI Batch API payloads with system + user messages from rendered templates.
load_prompt() — renders Jinja2 templates via make_module() and returns {"system": …, "user": …} string pairs.
count_batch_request_tokens() — tiktoken-based token counting for batch-size splitting.
VectorDB — wraps QdrantClient with connect(), create_collection(), and upsert().
class BranchChunk(TypedDict):
uuid: str
chunk: str
path: str
type: str
…
types.pyfrom typing import Literal, TypedDict
class BranchChunk(TypedDict):
uuid: str
chunk: str
path: str
type: str
class TextContent(TypedDict):
type: Literal["text"]
text: str
class Message(TypedDict):
role: Literal["system", "user"]
content: str | list[TextContent]
class RequestBody(TypedDict):
model: str
messages: list[Message]
stream: Literal[False]
class BatchRequest(TypedDict):
custom_id: str
method: Literal["POST"]
url: Literal["/v1/chat/completions"]
body: RequestBody
def create_batch_request(custom_id, model, **kwargs) -> BatchRequest:
return { "custom_id": custom_id, "method": "POST", ... }
…
prompt.py + manager.pyfrom tuutrag.types import BatchRequest
def create_batch_request(
custom_id: str,
model: str,
**kwargs
) -> BatchRequest:
return {
"custom_id": custom_id,
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": model,
"messages": [
{"role": "system", "content": kwargs["system"]},
{"role": "user", "content": kwargs["user"]},
],
"stream": False,
},
}
from pathlib import Path
from jinja2 import Environment, FileSystemLoader, meta
TEMPLATES_DIR = Path(__file__).parent / "templates"
def load_prompt(template_name: str, **kwargs) -> dict[str, str]:
env = Environment(
loader=FileSystemLoader(str(TEMPLATES_DIR)),
keep_trailing_newline=True,
comment_start_string="{#---",
comment_end_string="---#}",
)
template = env.get_template(template_name)
module = template.make_module(vars=kwargs)
return {
"system": str(module.system).strip(),
"user": str(module.user).strip(),
}
def count_batch_request_tokens(enc, payload) -> int:
total = 0
for message in payload["body"]["messages"]:
…
utils.pyfrom tiktoken.core import Encoding
from tuutrag.types import BatchRequest
def count_batch_request_tokens(
enc: Encoding, payload: BatchRequest
) -> int:
total = 0
for message in payload["body"]["messages"]:
content = message["content"]
if isinstance(content, str):
total += len(enc.encode(content))
elif isinstance(content, list):
for block in content:
if block.get("type") == "text":
total += len(enc.encode(block["text"]))
return total
Every LLM inference in the pipeline — leaf chunking, summarization, entity extraction,
and relationship extraction — is submitted through the OpenAI Batch API.
Requests are assembled via create_batch_request(), token-counted with
count_batch_request_tokens(), and split into JSONL files respecting per-batch
token limits. This approach provides 50% cost savings over synchronous completions while
enabling parallelized processing of hundreds of thousands of chunks. Embeddings are
handled separately through Gemini (gemini-embedding-001),
batched with sliding-window RPM/TPM rate limiting.
The pipeline is orchestrated across six Jupyter notebooks, each handling a
distinct phase. They share the tuutrag/ module layer for prompts,
types, and data management.