Eval
An evaluation is not a complex thing:
- You have a sample question
- You ask that question to your agent
- You then you use some metric or rubric to decide whether the answer is good enough or not, and why.
Explanations of (1) and (3) are provided below
Full Specification
dataset: string
metrics:
- Metric
- ...
dataset
The set of questions that will be asked to the agent, node or workflow in order to evaluate their performance
metrics
The set of metrics to evaluate the answers with. If none is given, the Agent, Workflow or Node will still be evaluated, but tis answers will not be judged.
Metric templates can contain three fields:
actual_output- Which will be replaced with the answer of the agent, node or other.input- Which will be replaced by the input, extracted from the dataset.expected_output- An optional value, that would also be extracted from the example.
They will produce the following outputs:
label- A single-word verbal equivalent of the score (e.g., 'Good', 'Bad', 'Hallucination'). Base this value on the instructions providedscore- The numerical value reflecting the quality of the evaluation, assigned as per the instructionsexplanation- A verbal explanation for the score and labels given
For example:
Your job is to judge whether this sentence:
{{actual_output}}
(1) a good answer th the following question:
{{input}}
and (2), whether contradicts this reference answer:
{{expected_output}}
Give it a score between 0 and 2 (one point for each criteria), explain your reasoning behind the score, and indicate whether it is Horrible (0 points), Bad (1 point) or Good (2 points)