refactor(eval): 3-mode eval harness (step1/step2/full) with Langfuse fixes

- Rewrite eval config with EvalMode (step1, step2, full) replacing prompt_variants - Rewrite runner with _run_step1, _run_step2, _run_full dispatch - CLI: replace --variants with --mode flag - Add 3 fixture YAMLs: classify_invoices (step1), process_invoices (step2), full_invoices (full) - Remove old freelance_invoices fixture - Langfuse: mode-aware dataset items (classifications for step1, extraction for step2, both for full) - Langfuse: link both prompts (batch_file_classifier + batch_processing) in full mode - Langfuse: post separate classification_precision/recall/f1 scores for full mode - Langfuse: skip misleading field_accuracy=0 when field_scores is empty (step1) - Langfuse: include step1_results in trace output - MockExecutor: mock async_session to bypass DB in full mode - Journey fixture: remove user_messages (only interactive test kept)
2026-03-24 16:18:51 +01:00
parent 63fa119543
commit d3f7099d93
13 changed files with 1409 additions and 439 deletions
--- a/services/batch-agent/eval/fixtures/journey_invoice_setup.yaml
+++ b/services/batch-agent/eval/fixtures/journey_invoice_setup.yaml
@@ -1,43 +1,25 @@
 # Journey Fixture: journey-invoice-setup
-# Tests that the journey chatbot correctly builds a prompt_template
-# for extracting tasks and notes from Italian invoices and meeting notes.
+# Used by `python -m eval interactive` for human-in-the-loop testing
+# of the journey chatbot's prompt-building conversation.

 type: journey
 name: journey-invoice-setup
 description: >
-  Test the journey chatbot's ability to explore a directory of Italian
-  invoices and meeting notes, ask relevant questions, and produce a
-  well-structured prompt_template for data extraction.
+  Interactive test for the journey chatbot — explore a directory of
+  Italian invoices and meeting notes, answer the chatbot's questions,
+  and verify it produces a well-structured prompt_template for data
+  extraction.

 directory: sample_files/invoices
-data_types: [tasks, notes, timelines]
-
-# Simulated user responses (the journey starts with the LLM exploring
-# the directory and asking its first question)
-user_messages:
-  - >
-    I want to extract action items from invoices and meeting notes.
-    The invoices are in Italian and contain work descriptions with
-    deadlines. Meeting notes have action items with checkboxes.
-  - >
-    Yes, map Italian priority keywords: "URGENTE" and "ALTA PRIORITÀ"
-    should be high priority, "media priorità" is medium, "bassa priorità"
-    is low. Items marked with [x] are already completed.
-  - >
-    For notes, I want meeting summaries with the full content including
-    decisions and attendees. For timelines, extract deadlines and
-    scheduled meeting dates.
-  - >
-    That's everything I need. Please generate the template.
+data_types: [tasks, notes, timelines, projects]

 # Criteria the generated prompt_template must satisfy
 # Each is scored 0-1 by an LLM judge
 expected_template_criteria:
  - "Mentions creating tasks from action items and work descriptions"
-  - "Includes Italian priority keyword mapping (URGENTE→high, media priorità→medium, bassa priorità→low)"
-  - "Handles completed items marked with [x] as status done"
  - "Mentions creating notes from meeting summaries"
  - "Mentions extracting timeline events from deadlines and meeting dates"
+  - "Mentions creating projects from relevant information"
  - "Sets isAiSuggested=1 on all created records"
  - "Does NOT include projectId assignment logic"
  - "Uses camelCase field names (title, status, priority, dueDate, content)"