Files

Roberto Musso d3f7099d93 refactor(eval): 3-mode eval harness (step1/step2/full) with Langfuse fixes

- Rewrite eval config with EvalMode (step1, step2, full) replacing prompt_variants
- Rewrite runner with _run_step1, _run_step2, _run_full dispatch
- CLI: replace --variants with --mode flag
- Add 3 fixture YAMLs: classify_invoices (step1), process_invoices (step2), full_invoices (full)
- Remove old freelance_invoices fixture
- Langfuse: mode-aware dataset items (classifications for step1, extraction for step2, both for full)
- Langfuse: link both prompts (batch_file_classifier + batch_processing) in full mode
- Langfuse: post separate classification_precision/recall/f1 scores for full mode
- Langfuse: skip misleading field_accuracy=0 when field_scores is empty (step1)
- Langfuse: include step1_results in trace output
- MockExecutor: mock async_session to bypass DB in full mode
- Journey fixture: remove user_messages (only interactive test kept)

2026-03-24 16:18:51 +01:00

app

refactor: deduplicate shared code into shared/ module

2026-03-23 23:01:45 +01:00

eval

refactor(eval): 3-mode eval harness (step1/step2/full) with Langfuse fixes

2026-03-24 16:18:51 +01:00

Dockerfile

feat(batch-agent): extract Batch Agent Service (Step 3)

2026-03-23 07:19:02 +01:00

README.md

docs: add Langfuse integration TODO for batch-agent service

2026-03-23 00:25:42 +01:00

requirements.txt

feat(batch-agent): integrate Langfuse tracing

2026-03-23 08:43:15 +01:00

README.md

Batch Agent Service

Owns: agent_runner, journey builder, filesystem_agent, integrations (Gmail, MS Graph).

Tables owned

local_agent_configs
cloud_agent_configs
agent_run_logs

Endpoints

GET /agents/catalog
POST /agents/can-create
POST /agents/trigger
GET /agents/{id}/history

Redis channels

Subscribe: batch:request:{user_id}
Publish: ws:out:{user_id} (journey replies + tool calls)
BRPOP: tool:result:{call_id} (30s timeout)
SET+EX: journey:{user_id} (session state, TTL 1800s)

TODO

Integrate Langfuse tracing (reuse services/chat/app/tracing.py pattern — trace_span(), get_langfuse_callback(), prompt management). Each batch agent run should create a trace with input/output, link prompts, and pass the LangChain CallbackHandler to LLM calls.