- journey_runner.py: orchestrates journey start → simulated user messages → template extraction → LLM judge scoring - config.py: JourneyFixture dataclass with user_messages and expected_template_criteria, discover_journey_fixtures() - langfuse_eval.py: sync_journey_fixture_to_dataset() - cli.py: new 'journey' subcommand (python -m eval journey) with --fixture, --models, --judge-model flags - fixtures/journey_invoice_setup.yaml: example journey fixture with 4 user messages and 8 quality criteria
47 lines
2.0 KiB
YAML
47 lines
2.0 KiB
YAML
# Journey Fixture: journey-invoice-setup
|
|
# Tests that the journey chatbot correctly builds a prompt_template
|
|
# for extracting tasks and notes from Italian invoices and meeting notes.
|
|
|
|
type: journey
|
|
name: journey-invoice-setup
|
|
description: >
|
|
Test the journey chatbot's ability to explore a directory of Italian
|
|
invoices and meeting notes, ask relevant questions, and produce a
|
|
well-structured prompt_template for data extraction.
|
|
|
|
directory: sample_files/invoices
|
|
data_types: [tasks, notes, timelines]
|
|
|
|
# Simulated user responses (the journey starts with the LLM exploring
|
|
# the directory and asking its first question)
|
|
user_messages:
|
|
- >
|
|
I want to extract action items from invoices and meeting notes.
|
|
The invoices are in Italian and contain work descriptions with
|
|
deadlines. Meeting notes have action items with checkboxes.
|
|
- >
|
|
Yes, map Italian priority keywords: "URGENTE" and "ALTA PRIORITÀ"
|
|
should be high priority, "media priorità" is medium, "bassa priorità"
|
|
is low. Items marked with [x] are already completed.
|
|
- >
|
|
For notes, I want meeting summaries with the full content including
|
|
decisions and attendees. For timelines, extract deadlines and
|
|
scheduled meeting dates.
|
|
- >
|
|
That's everything I need. Please generate the template.
|
|
|
|
# Criteria the generated prompt_template must satisfy
|
|
# Each is scored 0-1 by an LLM judge
|
|
expected_template_criteria:
|
|
- "Mentions creating tasks from action items and work descriptions"
|
|
- "Includes Italian priority keyword mapping (URGENTE→high, media priorità→medium, bassa priorità→low)"
|
|
- "Handles completed items marked with [x] as status done"
|
|
- "Mentions creating notes from meeting summaries"
|
|
- "Mentions extracting timeline events from deadlines and meeting dates"
|
|
- "Sets isAiSuggested=1 on all created records"
|
|
- "Does NOT include projectId assignment logic"
|
|
- "Uses camelCase field names (title, status, priority, dueDate, content)"
|
|
|
|
# Models to test (empty = use CLI --models default)
|
|
models: []
|