Keep only 4.1 (first reply contains question) as automated eval. Multi-turn cases (4.2–4.5) are non-deterministic and tested manually with results tracked in Langfuse. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
20 lines
698 B
YAML
20 lines
698 B
YAML
# Journey V2 eval test cases — Step 4
|
||
#
|
||
# Only case 4.1 is kept as an automated eval. Cases 4.2–4.5 (multi-turn
|
||
# conversations that expect the LLM to produce a complete AgentConfig)
|
||
# are non-deterministic and tested manually — results tracked in Langfuse.
|
||
#
|
||
# Assertion keys:
|
||
# expect_question: true → first reply must contain "?"
|
||
|
||
- id: "4.1"
|
||
description: "Journey start explores directory, first reply contains a question"
|
||
directory: "/test/emails"
|
||
data_types: ["tasks", "notes", "timelines"]
|
||
directory_files:
|
||
- path: "/test/emails/outlook_export_2024.html"
|
||
content_file: "email_action.html"
|
||
user_messages: []
|
||
score_name: "journey.start"
|
||
expect_question: true
|