Metric

A metric utilized to judge the response given by an AI Agent.

To scale it properly, the Judge will be an LLM itself

Full Specification

judge: GenaiClient
template: string

`judge`

The GenaiClient that will judge the response according to the template above.

`template`

Metric templates can contain three fields:

actual_output - Which will be replaced with the answer of the agent, node or other.
input - Which will be replaced by the input, extracted from the dataset.
expected_output - An optional value, that would also be extracted from the example.

They will produce the following outputs:

label - A single-word verbal equivalent of the score (e.g., 'Good', 'Bad', 'Hallucination'). Base this value on the instructions provided
score - The numerical value reflecting the quality of the evaluation, assigned as per the instructions
explanation - A verbal explanation for the score and labels given

For example:

Your job is to judge whether this sentence:

{{actual_output}}

(1) a good answer th the following question:

{{input}}

and (2), whether contradicts this reference answer:

{{expected_output}}

Give it a score between 0 and 2 (one point for each criteria), explain your reasoning behind the score, and indicate whether it is Horrible (0 points), Bad (1 point) or Good (2 points)