- Rewrite eval config with EvalMode (step1, step2, full) replacing prompt_variants - Rewrite runner with _run_step1, _run_step2, _run_full dispatch - CLI: replace --variants with --mode flag - Add 3 fixture YAMLs: classify_invoices (step1), process_invoices (step2), full_invoices (full) - Remove old freelance_invoices fixture - Langfuse: mode-aware dataset items (classifications for step1, extraction for step2, both for full) - Langfuse: link both prompts (batch_file_classifier + batch_processing) in full mode - Langfuse: post separate classification_precision/recall/f1 scores for full mode - Langfuse: skip misleading field_accuracy=0 when field_scores is empty (step1) - Langfuse: include step1_results in trace output - MockExecutor: mock async_session to bypass DB in full mode - Journey fixture: remove user_messages (only interactive test kept)
Batch Agent Service
Owns: agent_runner, journey builder, filesystem_agent, integrations (Gmail, MS Graph).
Tables owned
local_agent_configscloud_agent_configsagent_run_logs
Endpoints
GET /agents/catalogPOST /agents/can-createPOST /agents/triggerGET /agents/{id}/history
Redis channels
- Subscribe:
batch:request:{user_id} - Publish:
ws:out:{user_id}(journey replies + tool calls) - BRPOP:
tool:result:{call_id}(30s timeout) - SET+EX:
journey:{user_id}(session state, TTL 1800s)
TODO
- Integrate Langfuse tracing (reuse
services/chat/app/tracing.pypattern —trace_span(),get_langfuse_callback(), prompt management). Each batch agent run should create a trace with input/output, link prompts, and pass the LangChainCallbackHandlerto LLM calls.