feat(batch-agent): add E2E evaluation harness with Langfuse integration
- eval/mock_executor.py: intercepts execute_on_client, serves fixture files from disk, records all mutations (insert/update/delete) - eval/config.py: YAML fixture loader with prompt variants, expected results, seed records, model overrides - eval/scorer.py: FieldMatchScorer (fuzzy title match, per-field accuracy, precision/recall/F1) + LLMJudgeScorer (semantic eval) - eval/langfuse_eval.py: sync fixtures to Langfuse datasets, create dataset runs, post scores, link traces to runs - eval/runner.py: orchestrates fixture → mock → agent pipeline → scoring → Langfuse reporting - eval/cli.py: CLI (python -m eval run/list/sync) with --models, --variants, --fixture, --no-judge flags - eval/fixtures/: example Italian freelance scenario with 3 prompt variants (baseline, detailed_italian, minimal)
This commit is contained in:
86
services/batch-agent/eval/fixtures/freelance_invoices.yaml
Normal file
86
services/batch-agent/eval/fixtures/freelance_invoices.yaml
Normal file
@@ -0,0 +1,86 @@
|
||||
# Fixture: freelance-invoices
|
||||
# Tests extraction of tasks, notes, and timelines from
|
||||
# invoices and meeting notes typical of a freelance workflow.
|
||||
|
||||
name: freelance-invoices
|
||||
description: >
|
||||
Extract tasks, notes, and timeline events from Italian freelance
|
||||
invoices and meeting notes. Tests project matching, priority
|
||||
mapping, and bilingual content handling.
|
||||
|
||||
directory: sample_files/invoices
|
||||
data_types: [tasks, notes, timelines]
|
||||
file_extensions: [txt, md]
|
||||
|
||||
# Pre-existing records in the "database"
|
||||
seed_records:
|
||||
projects:
|
||||
- id: "proj-web-redesign"
|
||||
name: "Redesign Sito Web Corporate"
|
||||
status: "active"
|
||||
aiSummary: "Corporate website redesign for Studio Architettura Bianchi"
|
||||
- id: "proj-ecommerce"
|
||||
name: "E-Commerce FashionStore"
|
||||
status: "active"
|
||||
aiSummary: "Next.js e-commerce platform for FashionStore srl"
|
||||
tasks: []
|
||||
notes: []
|
||||
timelines: []
|
||||
|
||||
# Prompt variations to compare
|
||||
prompt_variants:
|
||||
baseline: |
|
||||
Extract action items as tasks and summaries as notes.
|
||||
For timelines, extract any mentioned dates and deadlines.
|
||||
Set isAiSuggested=1 on every record.
|
||||
|
||||
detailed_italian: |
|
||||
Estrai i dati dai file come segue:
|
||||
- TASK: ogni azione da fare, deliverable, o item con scadenza.
|
||||
Mappa "URGENTE" o "ALTA PRIORITÀ" → priority: high.
|
||||
Mappa "media priorità" → priority: medium.
|
||||
Mappa "bassa priorità" → priority: low.
|
||||
Se un item è marcato come "completato" o [x], impostalo status: done.
|
||||
Altrimenti status: todo.
|
||||
- NOTE: riassunti di meeting, decisioni prese, note tecniche.
|
||||
Il titolo deve essere descrittivo. Il content deve includere tutti i dettagli.
|
||||
- TIMELINE: date di scadenza, milestone, meeting futuri.
|
||||
Formato data: timestamp Unix in millisecondi.
|
||||
Imposta sempre isAiSuggested=1.
|
||||
|
||||
minimal: |
|
||||
Extract only high-priority action items as tasks.
|
||||
Ignore notes and timelines unless explicitly marked as important.
|
||||
Set isAiSuggested=1.
|
||||
|
||||
# Expected extractions (what the agent SHOULD produce)
|
||||
# Only key fields are specified — scorer uses fuzzy matching
|
||||
expected:
|
||||
tasks:
|
||||
- title: "Sviluppo frontend React"
|
||||
priority: "high"
|
||||
status: "todo"
|
||||
- title: "Integrazione API backend"
|
||||
priority: "medium"
|
||||
status: "todo"
|
||||
- title: "Testing cross-browser e fix bug responsive"
|
||||
status: "todo"
|
||||
- title: "Preparare wireframe homepage"
|
||||
priority: "high"
|
||||
status: "todo"
|
||||
- title: "Setup progetto Next.js e configurare CI/CD"
|
||||
priority: "medium"
|
||||
status: "todo"
|
||||
- title: "Ricerca plugin Stripe per gestione abbonamenti"
|
||||
priority: "low"
|
||||
status: "todo"
|
||||
|
||||
notes:
|
||||
- title: "Meeting Kickoff Progetto E-Commerce"
|
||||
|
||||
timelines:
|
||||
- title: "MVP E-Commerce pronto"
|
||||
- title: "Meeting di revisione"
|
||||
|
||||
# Models to test (can be overridden via CLI --models)
|
||||
models: []
|
||||
Reference in New Issue
Block a user