- Rewrite eval config with EvalMode (step1, step2, full) replacing prompt_variants - Rewrite runner with _run_step1, _run_step2, _run_full dispatch - CLI: replace --variants with --mode flag - Add 3 fixture YAMLs: classify_invoices (step1), process_invoices (step2), full_invoices (full) - Remove old freelance_invoices fixture - Langfuse: mode-aware dataset items (classifications for step1, extraction for step2, both for full) - Langfuse: link both prompts (batch_file_classifier + batch_processing) in full mode - Langfuse: post separate classification_precision/recall/f1 scores for full mode - Langfuse: skip misleading field_accuracy=0 when field_scores is empty (step1) - Langfuse: include step1_results in trace output - MockExecutor: mock async_session to bypass DB in full mode - Journey fixture: remove user_messages (only interactive test kept)
29 lines
1.1 KiB
YAML
29 lines
1.1 KiB
YAML
# Journey Fixture: journey-invoice-setup
|
|
# Used by `python -m eval interactive` for human-in-the-loop testing
|
|
# of the journey chatbot's prompt-building conversation.
|
|
|
|
type: journey
|
|
name: journey-invoice-setup
|
|
description: >
|
|
Interactive test for the journey chatbot — explore a directory of
|
|
Italian invoices and meeting notes, answer the chatbot's questions,
|
|
and verify it produces a well-structured prompt_template for data
|
|
extraction.
|
|
|
|
directory: sample_files/invoices
|
|
data_types: [tasks, notes, timelines, projects]
|
|
|
|
# Criteria the generated prompt_template must satisfy
|
|
# Each is scored 0-1 by an LLM judge
|
|
expected_template_criteria:
|
|
- "Mentions creating tasks from action items and work descriptions"
|
|
- "Mentions creating notes from meeting summaries"
|
|
- "Mentions extracting timeline events from deadlines and meeting dates"
|
|
- "Mentions creating projects from relevant information"
|
|
- "Sets isAiSuggested=1 on all created records"
|
|
- "Does NOT include projectId assignment logic"
|
|
- "Uses camelCase field names (title, status, priority, dueDate, content)"
|
|
|
|
# Models to test (empty = use CLI --models default)
|
|
models: []
|