- Rewrite eval config with EvalMode (step1, step2, full) replacing prompt_variants
- Rewrite runner with _run_step1, _run_step2, _run_full dispatch
- CLI: replace --variants with --mode flag
- Add 3 fixture YAMLs: classify_invoices (step1), process_invoices (step2), full_invoices (full)
- Remove old freelance_invoices fixture
- Langfuse: mode-aware dataset items (classifications for step1, extraction for step2, both for full)
- Langfuse: link both prompts (batch_file_classifier + batch_processing) in full mode
- Langfuse: post separate classification_precision/recall/f1 scores for full mode
- Langfuse: skip misleading field_accuracy=0 when field_scores is empty (step1)
- Langfuse: include step1_results in trace output
- MockExecutor: mock async_session to bypass DB in full mode
- Journey fixture: remove user_messages (only interactive test kept)
- tracing.py: add compile_prompt() that uses Langfuse .compile(**vars)
for {{variable}} substitution, falls back to Python .format() for
hardcoded {variable} templates
- agent_runner.py: replace _get_system_prompt().format() with
tracing.compile_prompt() for batch_file_classifier, batch_processing,
batch_cloud_processing prompts
- journey.py: replace get_prompt + .format() with compile_prompt()
for journey_system prompt
- chat tracing.py: add compile_prompt() for parity (chat prompts
currently have no variables, but ready for future use)
- Remove unused _get_system_prompt helper
- _get_system_prompt helper: fetches managed prompts from Langfuse
with hardcoded fallback (same pattern as chat service)
- journey.py: journey_system prompt manageable via Langfuse
- agent_runner.py: batch_file_classifier, batch_processing,
batch_cloud_processing prompts all manageable via Langfuse
- redis_consumer.py: link_prompt_to_trace for all three handlers