feat(batch-agent): add E2E evaluation harness with Langfuse integration

- eval/mock_executor.py: intercepts execute_on_client, serves fixture
  files from disk, records all mutations (insert/update/delete)
- eval/config.py: YAML fixture loader with prompt variants, expected
  results, seed records, model overrides
- eval/scorer.py: FieldMatchScorer (fuzzy title match, per-field
  accuracy, precision/recall/F1) + LLMJudgeScorer (semantic eval)
- eval/langfuse_eval.py: sync fixtures to Langfuse datasets, create
  dataset runs, post scores, link traces to runs
- eval/runner.py: orchestrates fixture → mock → agent pipeline →
  scoring → Langfuse reporting
- eval/cli.py: CLI (python -m eval run/list/sync) with --models,
  --variants, --fixture, --no-judge flags
- eval/fixtures/: example Italian freelance scenario with 3 prompt
  variants (baseline, detailed_italian, minimal)

This commit is contained in:

Roberto Musso

2026-03-23 08:54:19 +01:00

parent 971f1dd84f

commit 75a826c9d8

12 changed files with 1382 additions and 0 deletions

									
										5

services/batch-agent/eval/__main__.py
									
										Normal file
									
												View File
												
				@@ -0,0 +1,5 @@

				"""Allow running the eval package as ``python -m eval``."""

				from eval.cli import main

				main()

feat(batch-agent): add E2E evaluation harness with Langfuse integration

5 services/batch-agent/eval/__main__.py Normal file Unescape Escape View File

5

services/batch-agent/eval/main.py Normal file

View File