- eval/mock_executor.py: intercepts execute_on_client, serves fixture files from disk, records all mutations (insert/update/delete) - eval/config.py: YAML fixture loader with prompt variants, expected results, seed records, model overrides - eval/scorer.py: FieldMatchScorer (fuzzy title match, per-field accuracy, precision/recall/F1) + LLMJudgeScorer (semantic eval) - eval/langfuse_eval.py: sync fixtures to Langfuse datasets, create dataset runs, post scores, link traces to runs - eval/runner.py: orchestrates fixture → mock → agent pipeline → scoring → Langfuse reporting - eval/cli.py: CLI (python -m eval run/list/sync) with --models, --variants, --fixture, --no-judge flags - eval/fixtures/: example Italian freelance scenario with 3 prompt variants (baseline, detailed_italian, minimal)
Batch Agent Service
Owns: agent_runner, journey builder, filesystem_agent, integrations (Gmail, MS Graph).
Tables owned
local_agent_configscloud_agent_configsagent_run_logs
Endpoints
GET /agents/catalogPOST /agents/can-createPOST /agents/triggerGET /agents/{id}/history
Redis channels
- Subscribe:
batch:request:{user_id} - Publish:
ws:out:{user_id}(journey replies + tool calls) - BRPOP:
tool:result:{call_id}(30s timeout) - SET+EX:
journey:{user_id}(session state, TTL 1800s)
TODO
- Integrate Langfuse tracing (reuse
services/chat/app/tracing.pypattern —trace_span(),get_langfuse_callback(), prompt management). Each batch agent run should create a trace with input/output, link prompts, and pass the LangChainCallbackHandlerto LLM calls.