- Rewrite eval config with EvalMode (step1, step2, full) replacing prompt_variants - Rewrite runner with _run_step1, _run_step2, _run_full dispatch - CLI: replace --variants with --mode flag - Add 3 fixture YAMLs: classify_invoices (step1), process_invoices (step2), full_invoices (full) - Remove old freelance_invoices fixture - Langfuse: mode-aware dataset items (classifications for step1, extraction for step2, both for full) - Langfuse: link both prompts (batch_file_classifier + batch_processing) in full mode - Langfuse: post separate classification_precision/recall/f1 scores for full mode - Langfuse: skip misleading field_accuracy=0 when field_scores is empty (step1) - Langfuse: include step1_results in trace output - MockExecutor: mock async_session to bypass DB in full mode - Journey fixture: remove user_messages (only interactive test kept)
82 lines
2.6 KiB
YAML
82 lines
2.6 KiB
YAML
# Fixture: process-invoices (step2)
|
|
# Tests _PROCESSING_SYSTEM_PROMPT — data extraction & tool calling.
|
|
# The classification step is skipped; prompt variables are injected directly.
|
|
|
|
name: process-invoices
|
|
mode: step2
|
|
description: >
|
|
Test data extraction from Italian freelance invoices.
|
|
Verifies correct record creation via tool calls with the right
|
|
fields, priorities, and status values.
|
|
|
|
directory: sample_files/invoices
|
|
data_types: [tasks, notes, timelines]
|
|
file_extensions: [txt, md]
|
|
|
|
# ── Step-2 prompt variables ──────────────────────────────────────
|
|
existing_context: |
|
|
Existing tasks:
|
|
(none)
|
|
|
|
Existing notes:
|
|
(none)
|
|
|
|
Existing timelines:
|
|
(none)
|
|
|
|
project_context: >
|
|
Project: Redesign Sito Web Corporate (id: proj-web-redesign).
|
|
Always set projectId to this id on every record you create.
|
|
|
|
custom_prompt_section: |
|
|
User instructions:
|
|
Estrai i dati dai file come segue:
|
|
- TASK: ogni azione da fare, deliverable, o item con scadenza.
|
|
Mappa "URGENTE" o "ALTA PRIORITÀ" → priority: high.
|
|
Mappa "media priorità" → priority: medium.
|
|
Mappa "bassa priorità" → priority: low.
|
|
Se un item è marcato come "completato" o [x], impostalo status: done.
|
|
Altrimenti status: todo.
|
|
- NOTE: riassunti di meeting, decisioni prese, note tecniche.
|
|
Il titolo deve essere descrittivo. Il content deve includere tutti i dettagli.
|
|
- TIMELINE: date di scadenza, milestone, meeting futuri.
|
|
Imposta sempre isAiSuggested=1.
|
|
|
|
# ── Seed records (pre-existing DB state) ─────────────────────────
|
|
seed_records:
|
|
projects:
|
|
- id: "proj-web-redesign"
|
|
name: "Redesign Sito Web Corporate"
|
|
status: "active"
|
|
tasks: []
|
|
notes: []
|
|
timelines: []
|
|
|
|
# ── Expected extractions ─────────────────────────────────────────
|
|
expected:
|
|
tasks:
|
|
- title: "Sviluppo frontend React"
|
|
priority: "high"
|
|
status: "todo"
|
|
- title: "Integrazione API backend"
|
|
priority: "medium"
|
|
status: "todo"
|
|
- title: "Testing cross-browser e fix bug responsive"
|
|
status: "todo"
|
|
- title: "Preparare wireframe homepage"
|
|
priority: "high"
|
|
status: "todo"
|
|
- title: "Setup progetto Next.js e configurare CI/CD"
|
|
priority: "medium"
|
|
status: "todo"
|
|
- title: "Ricerca plugin Stripe per gestione abbonamenti"
|
|
priority: "low"
|
|
status: "todo"
|
|
|
|
notes:
|
|
- title: "Meeting Kickoff Progetto E-Commerce"
|
|
|
|
timelines:
|
|
- title: "MVP E-Commerce pronto"
|
|
- title: "Meeting di revisione"
|