Concepts#

Data#

Reference_df

Testing LLMs involves preparing the following data for your use case:

Inputs to the LLM. Depending on the task at hand, these inputs are likely formatted to follow a prompt template.
Reference Outputs: these are your baseline outputs, which are optional in Arthur Bench but recommended to get a comprehensive understanding of your model’s performance relative to its expected outputs. These reference outputs would likely be either ground truth responses to the inputs, or could be outputs from a baseline LLM that you are evaluating against.
Candidate Outputs: these are the outputs from your candidate LLM that you are scoring.
Context: contextual information used to produce the candidate output, e.g. for retrieval-augmented Question & Answering tasks.

As an example, consider the task of Question & Answering about specific documents:

Input: “What war was referred to in the Gettysburg Address?”
Reference Output: American Civil War
Candidate Output: The war referenced in the Gettysburg Address is the American Civil War
Context: (Wikipedia) “The Gettysburg Address is a speech that U.S. President Abraham Lincoln delivered during the American Civil War at the dedication of the Soldiers’ National Cemetery, now known as Gettysburg National Cemetery, in Gettysburg, Pennsylvania on the afternoon of November 19, 1863, four and a half months after the Union armies defeated Confederate forces in the Battle of Gettysburg, the Civil War’s deadliest battle.”

Testing#

test_suites_runs

Test Suites#

A Test Suite stores the input & reference output data for your testing use case along with a scorer.

For example, for a summarization use case, your test suite could be created with:

the documents to summarize
baseline summaries as reference outputs to evaluate against
the SummaryQuality scorer

Test suites allow you to save and reuse your evaluation datasets over time with a consistent scorer to help you understand what drives changes in performance.

To view how to create test suites from various data formats, view our creating test suites guide

Test runs#

When a test suite is run, its scorer evaluates the candidate outputs provided in the run and assigns a score to each test case.

To run your test suite on candidate data, pass the data to the run() function of your test suite, along with any additional metadata you want to be logged for that run. To view the metadata you can save with your test runs, see the SDK docs

To view how to create test runs from various data formats, visit our test suites guide