feat(batch-agent): add journey eval to E2E harness

- journey_runner.py: orchestrates journey start → simulated user messages → template extraction → LLM judge scoring - config.py: JourneyFixture dataclass with user_messages and expected_template_criteria, discover_journey_fixtures() - langfuse_eval.py: sync_journey_fixture_to_dataset() - cli.py: new 'journey' subcommand (python -m eval journey) with --fixture, --models, --judge-model flags - fixtures/journey_invoice_setup.yaml: example journey fixture with 4 user messages and 8 quality criteria
2026-03-23 23:16:41 +01:00
parent d856dfd28c
commit 63fa119543
5 changed files with 643 additions and 11 deletions
--- a/services/batch-agent/eval/fixtures/journey_invoice_setup.yaml
+++ b/services/batch-agent/eval/fixtures/journey_invoice_setup.yaml
@@ -0,0 +1,46 @@
+# Journey Fixture: journey-invoice-setup
+# Tests that the journey chatbot correctly builds a prompt_template
+# for extracting tasks and notes from Italian invoices and meeting notes.
+
+type: journey
+name: journey-invoice-setup
+description: >
+  Test the journey chatbot's ability to explore a directory of Italian
+  invoices and meeting notes, ask relevant questions, and produce a
+  well-structured prompt_template for data extraction.
+
+directory: sample_files/invoices
+data_types: [tasks, notes, timelines]
+
+# Simulated user responses (the journey starts with the LLM exploring
+# the directory and asking its first question)
+user_messages:
+  - >
+    I want to extract action items from invoices and meeting notes.
+    The invoices are in Italian and contain work descriptions with
+    deadlines. Meeting notes have action items with checkboxes.
+  - >
+    Yes, map Italian priority keywords: "URGENTE" and "ALTA PRIORITÀ"
+    should be high priority, "media priorità" is medium, "bassa priorità"
+    is low. Items marked with [x] are already completed.
+  - >
+    For notes, I want meeting summaries with the full content including
+    decisions and attendees. For timelines, extract deadlines and
+    scheduled meeting dates.
+  - >
+    That's everything I need. Please generate the template.
+
+# Criteria the generated prompt_template must satisfy
+# Each is scored 0-1 by an LLM judge
+expected_template_criteria:
+  - "Mentions creating tasks from action items and work descriptions"
+  - "Includes Italian priority keyword mapping (URGENTE→high, media priorità→medium, bassa priorità→low)"
+  - "Handles completed items marked with [x] as status done"
+  - "Mentions creating notes from meeting summaries"
+  - "Mentions extracting timeline events from deadlines and meeting dates"
+  - "Sets isAiSuggested=1 on all created records"
+  - "Does NOT include projectId assignment logic"
+  - "Uses camelCase field names (title, status, priority, dueDate, content)"
+
+# Models to test (empty = use CLI --models default)
+models: []