refactor(eval): 3-mode eval harness (step1/step2/full) with Langfuse fixes

- Rewrite eval config with EvalMode (step1, step2, full) replacing prompt_variants
- Rewrite runner with _run_step1, _run_step2, _run_full dispatch
- CLI: replace --variants with --mode flag
- Add 3 fixture YAMLs: classify_invoices (step1), process_invoices (step2), full_invoices (full)
- Remove old freelance_invoices fixture
- Langfuse: mode-aware dataset items (classifications for step1, extraction for step2, both for full)
- Langfuse: link both prompts (batch_file_classifier + batch_processing) in full mode
- Langfuse: post separate classification_precision/recall/f1 scores for full mode
- Langfuse: skip misleading field_accuracy=0 when field_scores is empty (step1)
- Langfuse: include step1_results in trace output
- MockExecutor: mock async_session to bypass DB in full mode
- Journey fixture: remove user_messages (only interactive test kept)

This commit is contained in:

Roberto Musso

2026-03-24 16:18:51 +01:00

parent 63fa119543

commit d3f7099d93

13 changed files with 1409 additions and 439 deletions

									
										2

services/batch-agent/eval/scorer.py
									
												View File
												
				@@ -242,7 +242,7 @@ async def llm_judge_score(

				    Returns (score, reasoning).

				    """

				    from app.llm import get_llm

				    from shared.llm import get_llm

				    llm = get_llm(model=judge_model, temperature=0)

refactor(eval): 3-mode eval harness (step1/step2/full) with Langfuse fixes

2 services/batch-agent/eval/scorer.py Unescape Escape View File

2

services/batch-agent/eval/scorer.py

View File