Research - Open Source

HADR

Hierarchical Artifact Decomposition-Recomposition

1JPL / Caltech 2Dominican University 3UIUC 4J Sterling Morton East HS

A Knowledge Graph for Requirements Tracing

Requirement traceability, validation, and verification grow increasingly complex as engineering projects scale in size and interdependence. Technical specifications contain valuable structural information in natural language form. Advances in Large Language Models (LLMs) and their effectiveness in natural language processing and reasoning make specifications corpora amenable to information extraction and relational reasoning. While traditional requirements and test engineering methods rely on human expertise, LLMs have shown comparable performance in retrieval-augmented reasoning tasks.

We propose Hierarchical Artifact Decomposition-Recomposition (HADR), a knowledge graph construction method for reasoning-based requirements tracing that decomposes artifacts into hierarchical chunk trees via branch and leaf operations, recomposing context through upward summarization while preserving all artifact content. By explicitly decoupling entity extraction from relation inference, HADR surfaces implicit relational information prior to graph construction, enabling context-aware retrieval and reasoning across linked artifacts. Integrated into a RAG pipeline, the resulting knowledge graph supports scalable, multi-artifact requirements traceability by enabling ranked traceability linkages, knowledge graph search, and summarized reasoning across requirements and test cases.

Architecture at a Glance

HADR operates as a linear pipeline. An artifact enters, is decomposed into a depth-two tree, recomposed via summarization, stored in a relational + vector database, then enriched with entities and relationships that are upserted into a knowledge graph.

Artifact PDF → OCR → Text (Multimodal)
Branch Fixed \(K\), overlap \(P\)
Leaf LLM semantic split
Recompose Leaf → Branch → Root
Store SQLite + Qdrant
KG Build Entities → Relations

Pipeline Scale

453 Artifacts
37,239 Branch Chunks
322,120 Leaf Chunks
18.9M Tokens (o200k)
3,072 Embedding Dim
~15 GB Database Size

Decomposition

An artifact \(A\) is first tokenized with o200k_base (tiktoken) then sliced into overlapping branch chunks of fixed window \(K\) and overlap \(P\). Each branch chunk is then passed to an LLM with the leaf.j2 prompt, which semantically partitions it into non-overlapping leaf chunks whose union exactly reconstructs the branch. The result is a depth-two tree: root → branches → leaves.

Artifact \(A\) root
\(C_{b_1}\)branch
\(C_{b_2}\)branch
\(C_{b_N}\)branch
\(C_{l_1}\)
\(C_{l_2}\)
\(C_{l_3}\)
\(C_{l_4}\)
\(C_{l_5}\)
\(C_{l_M}\)
Fixed-size, overlapping · Variable-size, non-overlapping, LLM-inferred
tuutrag/prompts/templates/leaf.j2 Jinja2
{% set system %}
You are an expert linguist tool that semantically separates text.
Your task is to split the provided text into semantically meaningful
segments without altering the original content.
View full leaf.j2 prompt
tuutrag/prompts/templates/leaf.j2 Jinja2
{#---
description: A template for semantic text chunking tasks.
author: Pablo B., Marlon S.
---#}

{% set system %}
You are an expert linguist tool that semantically separates text.
Your task is to split the provided text into semantically meaningful
segments without altering the original content. Follow these
instructions carefully:

---Input---
- Preserve the original text exactly: wording, spelling, punctuation,
  whitespace, line breaks, and formatting must remain unchanged
  within each chunk.
- The text may include markdown, code blocks (including triple
  backticks), tables, LaTeX, visual-to-text transcription, or any
  other formats.
- If the input text contains the sequence <|>, replace every
  occurrence with [PIPE] before chunking to avoid confusion with
  the output delimiters.

---Task---
- Carefully break up the text into logical and semantically meaningful
  parts. Each chunk should represent a single, coherent idea or unit
  of meaning (such as a sentence, paragraph, or code block).
- Treat code blocks, tables, or other structured elements as atomic
  units where appropriate.
- Do not add any commentary, explanation, or preamble before or after
  the output.
- Do not add or remove any formatting characters such as triple
  backticks or newline characters inside the chunks.
- Return only the chunks concatenated together, each enclosed by the
  delimiters <|> with no spaces or newlines between chunks.

---Output Format---
<|>ChunkOne<|><|>ChunkTwo<|><|>LastChunk<|>
{% endset %}

{% set user %}
Semantically segment the following text into distinct meaning units,
where each chunk represents a single coherent idea or topic.

---Input Text---
{{ text }}
{% endset %}

Recomposition

Summaries propagate upward: the leaf chunks within each branch subtree are condensed into a branch-level summary via the summary.j2 prompt. Branch summaries are then aggregated into a single artifact-level summary. HADR is information-preserving, thus all original chunks remain alongside their summaries for downstream use.

Leaf Chunks
\(C_{l_1}\)\(C_{l_2}\)\(C_{l_M}\)
Branch Summary \(\hat{B}_i\)
×\(N\)
Artifact Summary \(\hat{A}\)
tuutrag/prompts/templates/summary.j2 Jinja2
{% set system %}
You are an expert summary tool tasked with writing a concise and
comprehensive summary of the input text.
View full summary.j2 prompt
tuutrag/prompts/templates/summary.j2 Jinja2
{#---
description: A template for text summarization tasks.
author: Pablo B., Marlon S.
---#}

{% set system %}
You are an expert summary tool tasked with writing a concise and
comprehensive summary of the input text. Follow these instructions
carefully:

---Input---
- The text may include markdown, code blocks (including triple
  backticks and newline characters), tables, LaTeX, visual-to-text
  transcription, or any other formats. When summarizing, convert
  these elements into clear, descriptive writing rather than
  including raw formatting or code, unless the code or formatting
  is critical to understanding the content.

---Task---
- Summarize the input text by including all key details and
  important points without adding any information not present in
  the original text.
- Do not add any commentary, explanation, or preamble before or
  after the output.
- Keep the summary concise, ideally within 3-5 sentences.
{% endset %}

{% set user %}
Write a summary of the following text, including as many key
details as possible.

---Input Text---
{{ text }}
{% endset %}

Storage Layer

All artifacts, branches, leaves, embeddings, and summaries are streamed into a SQLite database (master.db) using ijson for constant-memory ingestion. Embeddings are simultaneously upserted into Qdrant (cosine similarity, 3,072-dim vectors from gemini-embedding-001) to enable semantic nearest-neighbor search across branches and leaves. The UUID hierarchy encodes the tree structure:

Artifact 041f16b3-…-d300
Branch 041f16b3-…-d300.1
Leaf 041f16b3-…-d300.1.3
notebooks/database.ipynb SQL
CREATE TABLE IF NOT EXISTS artifacts (
    uuid    TEXT PRIMARY KEY,
    path    TEXT NOT NULL DEFAULT '',
View full database schema
notebooks/database.ipynb SQL
CREATE TABLE IF NOT EXISTS artifacts (
    uuid    TEXT PRIMARY KEY,
    path    TEXT NOT NULL DEFAULT '',
    type    TEXT NOT NULL DEFAULT '',
    summary TEXT
);

CREATE TABLE IF NOT EXISTS branches (
    uuid          TEXT PRIMARY KEY,
    artifact_uuid TEXT NOT NULL REFERENCES artifacts(uuid),
    chunk         TEXT NOT NULL DEFAULT '',
    path          TEXT NOT NULL DEFAULT '',
    summary       TEXT
);

CREATE TABLE IF NOT EXISTS branch_embeddings (
    branch_uuid TEXT PRIMARY KEY REFERENCES branches(uuid),
    embedding   TEXT NOT NULL
);

CREATE TABLE IF NOT EXISTS leafs (
    uuid        TEXT PRIMARY KEY,
    branch_uuid TEXT NOT NULL REFERENCES branches(uuid),
    text        TEXT NOT NULL DEFAULT '',
    entities    TEXT
);

CREATE TABLE IF NOT EXISTS leaf_embeddings (
    leaf_uuid TEXT PRIMARY KEY REFERENCES leafs(uuid),
    embedding TEXT NOT NULL
);
tuutrag/qdrant.py Python
class VectorDB():
    def __init__(self, port: int, host: str):
        self.client = self.connect(port, host)
View full qdrant.py
tuutrag/qdrant.py Python
from qdrant_client import QdrantClient
from tuutrag.utils import log_timestamp
from pathlib import Path

IMAGE = "qdrant/qdrant:latest"


class VectorDB():
    def __init__(self, port: int, host: str):
        self.client = self.connect(port, host)

    def connect(self, port: int, host: str):
        port = int(port)
        host = str(host)
        storage = str(Path(__file__).parent / "data/qdrant")
        try:
            client = QdrantClient(
                port=port, host=host, timeout=50, https=False
            )
            log_timestamp(f"Connected to http://{host}:{port}/dashboard")
            return client
        except Exception as e:
            raise RuntimeError(f"Error connecting to Qdrant: {e}")

    def create_collection(self, collection_name, vector_params):
        if self.client.collection_exists(
            collection_name=collection_name
        ):
            return
        self.client.create_collection(
            collection_name=collection_name,
            vectors_config=vector_params,
        )

    def upsert(self, collection_name: str, point):
        self.client.upsert(
            collection_name=collection_name,
            points=point,
            wait=True,
        )

Entity & Relation Extraction

HADR explicitly decouples entity extraction from relation inference and applies relation discovery at three progressively wider scopes:

Entities are extracted at the leaf level via the entity.j2 prompt where each leaf is processed independently, and the results written back as a JSON array. Local relations are inferred per-branch from the union of that branch's leaf entities using relation_local.j2, capturing relationships within a single branch. Global relations widen the scope to an entire artifact: embedding-similarity search across branches within the same document tree identifies semantically related chunks, and cross-set relation inference is performed using relation_global_uni.j2. Finally, universal relations span the full corpus: the same similarity search and relation_global_uni.j2 prompt are applied across branches from all ingested artifacts, surfacing cross-document dependencies that would otherwise remain implicit.

1

Entity Extraction

Per-leaf · entity.j2

Dr. Evelyn Hart Johns Hopkins FDA Case No. 2023-MED-0047 Compound XR-9B
2

Local Relations

Within branch · relation_local.j2

Dr. Evelyn Hart → works at → Johns Hopkins Case 2023-MED-0047 → reviewed by → FDA
3

Global Relations

Within artifact · relation_global_uni.j2

Case 2023-MED-0047 → concerns → Compound XR-9B FDA → suspended → Phase III TriNeuro Trial
4

Universal Relations

Across all artifacts · relation_global_uni.j2

Compound XR-9B → referenced in → REQ-042 Dr. Evelyn Hart → authored → TC-108
tuutrag/prompts/templates/entity.j2 Jinja2
{% set system %}
You are an entity extraction tool. Extract entities only from the
section labeled "Input Text".
View full entity.j2 prompt
tuutrag/prompts/templates/entity.j2 Jinja2
{#---
description: A template for named entity extraction tasks.
author: Pablo B., Marlon S.
---#}

{% set system %}
You are an entity extraction tool. Extract entities only from the
section labeled "Input Text". Follow these instructions carefully:

---Input---
- The entire input text represents a meaningful idea (such as a
  sentence, paragraph, or code block).
- If the entity name is case-insensitive, capitalize the first
  letter of each significant word (title case). Ensure consistent
  naming across the entire extraction process.
- If the input text contains the sequence <|>, replace it
  with [PIPE] to avoid confusion with the entity delimiter.

---Task---
- Identify all meaningful entities only from the input text.
- Do not add any commentary, explanation, or preamble.
- Return only the identified entities in the following format,
  with each entity enclosed by <|> delimiters.

---Output Format---
<|>EntityOne<|><|>EntityTwo<|><|>EntityThree<|>

---Example---
Input: "On March 14, 2023, Dr. Evelyn Hart of the Johns Hopkins
Neurology Department filed Case No. 2023-MED-0047 with the FDA
citing adverse reactions to Compound XR-9B, developed under NIH
Grant #R01NS112233, affecting patients in the Phase III TriNeuro
Trial across facilities in Baltimore, MD and Denver, CO."
Output: <|>March 14, 2023<|><|>Dr. Evelyn Hart<|>
<|>Johns Hopkins Neurology Department<|><|>Johns Hopkins<|>
<|>Case No. 2023-MED-0047<|><|>FDA<|><|>Compound XR-9B<|>
<|>NIH Grant #R01NS112233<|><|>NIH<|>
<|>Phase III TriNeuro Trial<|><|>Baltimore<|><|>MD<|>
<|>Denver<|><|>CO<|>
{% endset %}

{% set user %}
Extract all named entities from the following text.

---Input Text---
{{ text }}
{% endset %}
tuutrag/prompts/templates/relation_local.j2 Jinja2
{% set system %}
You are a relationship identification tool. Identify relationships
only from the section labeled "Entities".
View full relation_local.j2 prompt
tuutrag/prompts/templates/relation_local.j2 Jinja2
{#---
description: A template for local relationship identification
             tasks between entities.
author: Pablo B., Marlon S.
---#}

{% set system %}
You are a relationship identification tool. Identify relationships
only from the section labeled "Entities". "Raw Chunk" may be used
for disambiguation only. Follow these instructions carefully:

---Input---
- Entities are provided as a list, where each list item represents
  a single coherent idea.
- If the input text contains the sequence <|>, replace it
  with [PIPE] to avoid confusion with the entity delimiter.

---Task---
- Identify meaningful relationships between entities based on
  the provided information. Only extract relationships that are
  explicitly stated or directly implied.
- If a single statement describes a relationship involving more
  than two entities (an N-ary relationship), decompose it into
  multiple binary (two-entity) pairs.
  E.g. "REQ-12 and REQ-13 both derive from NEED-04" yields:
  "REQ-12, derives-from, NEED-04" and
  "REQ-13, derives-from, NEED-04."
- Relationships are directional: source → target.
- Do not add any commentary or preamble.

---Output Format---
<|>Source, relationship-type, Target<|>
{% endset %}

{% set user %}
Identify all relationships that are explicitly stated or directly
implied across the "Entities".

---Raw Chunk---
{{ raw_chunk }}

---Entities---
{{ entities }}
{% endset %}
tuutrag/prompts/templates/relation_global_uni.j2 Jinja2
{% set system %}
You are a relationship identification tool. Identify relationships
only from the sections labeled "Entities".
View full relation_global_uni.j2 prompt
tuutrag/prompts/templates/relation_global_uni.j2 Jinja2
{#---
description: A template for global relationship identification
             tasks between entities across multiple sets.
             This template is identical to relation_global.j2.
             Both map to the same underlying prompt.
author: Pablo B., Marlon S.
---#}

{% set system %}
You are a relationship identification tool. Identify relationships
only from the sections labeled "Entities". Each "Relations",
"Raw Chunk" and "Artifact Summary" may be used for disambiguation
of their respective set only. Follow these instructions carefully:

---Input---
Input is provided across K numbered sets, each formatted as:
 [SET {n}]
 Entities: <comma-separated list>
 Relations: <"<|>"-comma separated list>
 Raw Chunk: <source text>
 Artifact Summary: <artifact-level summary>
- Treat each entity as scoped to its own set.
- If the input text contains the sequence <|>, replace it
  with [PIPE].

---Task---
- Identify meaningful relationships between entities across
  different sets. Do not extract relationships within the same set.
- Only extract relationships explicitly stated or directly implied.
- Decompose N-ary relationships into binary pairs.
- Relationships are directional: source → target.

---Output Format---
<|>Source, relationship-type, Target<|>

---Example---
Input:
[SET 1]
Entities: Dr. Evelyn Hart, Johns Hopkins, Case No. 2023-MED-0047, FDA
Relations: <|>Dr. Evelyn Hart, works at, Johns Hopkins<|>
           <|>Case No. 2023-MED-0047, reviewed by, FDA<|>
Raw Chunk: "…filed Case No. 2023-MED-0047 with the FDA citing
           adverse reactions to Compound XR-9B."
Artifact Summary: "FDA case filing regarding drug adverse events."
[SET 2]
Entities: Compound XR-9B, NIH, Phase III TriNeuro Trial
Relations: <|>Compound XR-9B, funded by, NIH Grant #R01NS112233<|>
           <|>Compound XR-9B, tested in, Phase III TriNeuro Trial<|>
Raw Chunk: "Compound XR-9B…was the subject of an adverse event report…
           The Phase III TriNeuro Trial was subsequently suspended."
Artifact Summary: "NIH-funded clinical trial documentation."
Output:
<|>Case No. 2023-MED-0047, concerns, Compound XR-9B<|>
<|>Dr. Evelyn Hart, reported, Compound XR-9B<|>
<|>FDA, suspended, Phase III TriNeuro Trial<|>
{% endset %}

{% set user %}
Identify all relationships that are explicitly stated or directly
implied across the "Entities".

---Raw Chunk---
{{ raw_chunk }}

---Entities---
{{ entities }}

---Relations---
{{ relations }}

---Artifact Summary---
{{ artifact_summary }}
{% endset %}

Knowledge Graph

All extracted triples (local, global, and universal) are merged and upserted into Memgraph using the Bolt protocol. Each entity becomes a :Entity node, each relationship a directed :RELATIONSHIP edge with a type property. The MemgraphConnection class parses <|>-delimited triples from JSONL and executes idempotent MERGE Cypher queries.

REQ-042
depends on →
REQ-017
verified by ↓ tested by ↓
TC-108
TC-211
references ↓ references ↓
ICD-003
part of →
Thermal Subsystem
Requirement Test Case Support
tuutrag/memgraph.py Python
@export
class MemgraphConnection:
    obj_name = "MemgraphConnection"
    workspace = "kg"
View full memgraph.py
tuutrag/memgraph.py Python
import json
from typing import List
from exports import export
from neo4j import GraphDatabase


@export
class MemgraphConnection:
    obj_name = "MemgraphConnection"
    workspace = "kg"

    def __init__(self, port: int, frontend_port: int, host: str):
        self.driver = self.__connect(port, frontend_port, host)

    def __connect(self, port, frontend_port, host):
        URI = f"bolt://{host}:{int(port)}"
        driver = GraphDatabase.driver(URI, auth=("", ""))
        driver.verify_connectivity()
        print(f"Memgraph UI → http://{host}:{frontend_port}/")
        print(f"Memgraph Bolt → {URI}")
        return driver

    def read_data(self, file_path: str) -> List:
        parts = []
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                data = json.loads(line)
                content = data["text"].split("<|>")[1]
                part = [i.strip() for i in content.split(",")]
                if len(part) == 3:
                    parts.append(part)
        return parts

    def upsert(self, data: List) -> None:
        with self.driver.session() as session:
            for i, part in enumerate(data):
                session.run("""
                    MERGE (source:Entity {name: $source})
                    MERGE (target:Entity {name: $target})
                    MERGE (source)-[r:RELATIONSHIP]->(target)
                    SET r.type = $relationship
                """, source=part[0],
                     relationship=part[1],
                     target=part[2])
                print(f"upserted {i+1}/{len(data)}: "
                      f"{part[0]} -[{part[1]}]-> {part[2]}")

Key Modules

The codebase is organized into a compact Python package (tuutrag/) and a set of Jinja2 prompt templates. Below are the core building blocks that power every pipeline stage.

types.py

BranchChunk, BatchRequest, Message, RequestBody — typed dicts enforcing the data contract across stages.

data.py

DataManager — resolves file paths across raw/, interim/, processed/, and api/ directories. Singleton instance D.

prompt.py

create_batch_request() — assembles OpenAI Batch API payloads with system + user messages from rendered templates.

prompts/manager.py

load_prompt() — renders Jinja2 templates via make_module() and returns {"system": …, "user": …} string pairs.

utils.py

count_batch_request_tokens() — tiktoken-based token counting for batch-size splitting.

qdrant.py

VectorDB — wraps QdrantClient with connect(), create_collection(), and upsert().

tuutrag/types.py Python
class BranchChunk(TypedDict):
    uuid: str
    chunk: str
    path: str
    type: str
View full types.py
tuutrag/types.py Python
from typing import Literal, TypedDict


class BranchChunk(TypedDict):
    uuid: str
    chunk: str
    path: str
    type: str


class TextContent(TypedDict):
    type: Literal["text"]
    text: str


class Message(TypedDict):
    role: Literal["system", "user"]
    content: str | list[TextContent]


class RequestBody(TypedDict):
    model: str
    messages: list[Message]
    stream: Literal[False]


class BatchRequest(TypedDict):
    custom_id: str
    method: Literal["POST"]
    url: Literal["/v1/chat/completions"]
    body: RequestBody
tuutrag/prompt.py + tuutrag/prompts/manager.py Python
def create_batch_request(custom_id, model, **kwargs) -> BatchRequest:
    return { "custom_id": custom_id, "method": "POST", ... }
View full prompt.py + manager.py
tuutrag/prompt.py Python
from tuutrag.types import BatchRequest


def create_batch_request(
    custom_id: str,
    model: str,
    **kwargs
) -> BatchRequest:
    return {
        "custom_id": custom_id,
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": model,
            "messages": [
                {"role": "system", "content": kwargs["system"]},
                {"role": "user",   "content": kwargs["user"]},
            ],
            "stream": False,
        },
    }
tuutrag/prompts/manager.py Python
from pathlib import Path
from jinja2 import Environment, FileSystemLoader, meta


TEMPLATES_DIR = Path(__file__).parent / "templates"


def load_prompt(template_name: str, **kwargs) -> dict[str, str]:
    env = Environment(
        loader=FileSystemLoader(str(TEMPLATES_DIR)),
        keep_trailing_newline=True,
        comment_start_string="{#---",
        comment_end_string="---#}",
    )
    template = env.get_template(template_name)
    module = template.make_module(vars=kwargs)
    return {
        "system": str(module.system).strip(),
        "user":   str(module.user).strip(),
    }
tuutrag/utils.py Python
def count_batch_request_tokens(enc, payload) -> int:
    total = 0
    for message in payload["body"]["messages"]:
View full utils.py
tuutrag/utils.py Python
from tiktoken.core import Encoding
from tuutrag.types import BatchRequest


def count_batch_request_tokens(
    enc: Encoding, payload: BatchRequest
) -> int:
    total = 0
    for message in payload["body"]["messages"]:
        content = message["content"]
        if isinstance(content, str):
            total += len(enc.encode(content))
        elif isinstance(content, list):
            for block in content:
                if block.get("type") == "text":
                    total += len(enc.encode(block["text"]))
    return total

Batch Processing

Every LLM inference in the pipeline — leaf chunking, summarization, entity extraction, and relationship extraction — is submitted through the OpenAI Batch API. Requests are assembled via create_batch_request(), token-counted with count_batch_request_tokens(), and split into JSONL files respecting per-batch token limits. This approach provides 50% cost savings over synchronous completions while enabling parallelized processing of hundreds of thousands of chunks. Embeddings are handled separately through Gemini (gemini-embedding-001), batched with sliding-window RPM/TPM rate limiting.

Notebooks

The pipeline is orchestrated across six Jupyter notebooks, each handling a distinct phase. They share the tuutrag/ module layer for prompts, types, and data management.