Roberto Musso
|
63fa119543
|
feat(batch-agent): add journey eval to E2E harness
- journey_runner.py: orchestrates journey start → simulated user
messages → template extraction → LLM judge scoring
- config.py: JourneyFixture dataclass with user_messages and
expected_template_criteria, discover_journey_fixtures()
- langfuse_eval.py: sync_journey_fixture_to_dataset()
- cli.py: new 'journey' subcommand (python -m eval journey)
with --fixture, --models, --judge-model flags
- fixtures/journey_invoice_setup.yaml: example journey fixture
with 4 user messages and 8 quality criteria
|
2026-03-23 23:16:41 +01:00 |
|
Roberto Musso
|
75a826c9d8
|
feat(batch-agent): add E2E evaluation harness with Langfuse integration
- eval/mock_executor.py: intercepts execute_on_client, serves fixture
files from disk, records all mutations (insert/update/delete)
- eval/config.py: YAML fixture loader with prompt variants, expected
results, seed records, model overrides
- eval/scorer.py: FieldMatchScorer (fuzzy title match, per-field
accuracy, precision/recall/F1) + LLMJudgeScorer (semantic eval)
- eval/langfuse_eval.py: sync fixtures to Langfuse datasets, create
dataset runs, post scores, link traces to runs
- eval/runner.py: orchestrates fixture → mock → agent pipeline →
scoring → Langfuse reporting
- eval/cli.py: CLI (python -m eval run/list/sync) with --models,
--variants, --fixture, --no-judge flags
- eval/fixtures/: example Italian freelance scenario with 3 prompt
variants (baseline, detailed_italian, minimal)
|
2026-03-23 08:54:19 +01:00 |
|