Teaching to Understand, Understanding to Teach: RAG for Requirements Traceability

1Jet Propulsion Laboratory/California Institute of Technology 2Dominican University 3University of Illinois Urbana-Champaign 4J Sterling Morton East High School
WORK IN PROGRESS

Abstract

This section is under work.

Background

This report details the research progress for the project titled “Teaching to Understand, Understanding to Teach: Retrieval Augmented Generation for Requirements Traceability.” The research was initially conducted for the Jet Propulsion Laboratory (“JPL”), California Institute of Technology (“Caltech”), under the sponsorship of the National Aeronautics and Space Administration (“NASA”).

Definitions

Term Definition
RAG Retrieval-Augmented Generation; a technique enhancing LLMs with external data.
Embeddings Vector representations of text segments used for semantic search.
Information Supply The existing, ingested, corpus of data used by the system.
Traceability The ability to link requirements to their corresponding code implementation.
Artifact(s) Pieces of information pertaining to the system information supply.
Requirement Formally agreed-upon shall statement defining a system condition, capability, or constraint.
Test Case Detailed, step-by-step, series of instruction on an end product used to verify and validate a specific requisite or requirement.
Entity A said object or thing acting individually.
Relationship A link between two entities.
Branch Chunk Fixed-size, overlapping, subset of tokens from an artifact.
Leaf Chunk Variable-size, non-overlapping, semantic subset of tokens from a branch chunk.

Notation

Symbol Description
\( N \) Sample size of the set under consideration
\( A \) Artifact
\( R \) Requirement
\( T \) Test Case
\( E \) , \( e \) Entity set; entity
\( R \) , \( r \) Relation set; relation
\( B \) , \( b \) Branch chunk; leaf chunk

Entry Point

The initial state of our tracing system, whether in the first or repreated usage of the tool, is the entry point. The goal of the system is to suggest traceability mappings from either an individual requirement (\(R\)) or a test case (\(T\)) to trace sets of \(R\) or \(T\), respectively. Two types of input, traceable artifacts and contextual artifacts, govern the input space. In our work, artifacts are pieces of information pertaining to the information supply, corpus of data, that the system knows of. A traceable artifact is an information artifact that represents system intent or system verification and is eligible to participate as a source or target in a traceability relationship; in this system, requirements and test cases encode such features. Contextual artifacts are support artifacts, non- requirement and test artifacts, that provide auxiliary information to the information supply for context-retrieval support. Contextual artifacts are non-traceable units, contrary to traceable artifacts, whose purpose is to be in a source or target set.

Artifact Submission

Flowchart of Entry Point
Figure 1 An artifact of type Support, Requirement, or Test Case is ingested. The [+] Type variable is the type selection, along with the artifact itself.