8 Commits

Author SHA1 Message Date
Roberto Musso
fe0dd038ee fix: Langfuse SDK v4 migration, tracing improvements, and LLM config
- Langfuse SDK v4: fix prompt-to-trace linking (as_type=generation)
- tracing: compile_prompt with Langfuse managed prompt fallback
- journey: remove journey CLI subcommand (keep only interactive)
- LLM: add service-specific llm modules for batch-agent and chat
- gitignore: exclude eval private test data
- config: add LANGFUSE settings to shared config
2026-03-24 16:25:51 +01:00
Roberto Musso
d3f7099d93 refactor(eval): 3-mode eval harness (step1/step2/full) with Langfuse fixes
- Rewrite eval config with EvalMode (step1, step2, full) replacing prompt_variants
- Rewrite runner with _run_step1, _run_step2, _run_full dispatch
- CLI: replace --variants with --mode flag
- Add 3 fixture YAMLs: classify_invoices (step1), process_invoices (step2), full_invoices (full)
- Remove old freelance_invoices fixture
- Langfuse: mode-aware dataset items (classifications for step1, extraction for step2, both for full)
- Langfuse: link both prompts (batch_file_classifier + batch_processing) in full mode
- Langfuse: post separate classification_precision/recall/f1 scores for full mode
- Langfuse: skip misleading field_accuracy=0 when field_scores is empty (step1)
- Langfuse: include step1_results in trace output
- MockExecutor: mock async_session to bypass DB in full mode
- Journey fixture: remove user_messages (only interactive test kept)
2026-03-24 16:18:51 +01:00
Roberto Musso
63fa119543 feat(batch-agent): add journey eval to E2E harness
- journey_runner.py: orchestrates journey start → simulated user
  messages → template extraction → LLM judge scoring
- config.py: JourneyFixture dataclass with user_messages and
  expected_template_criteria, discover_journey_fixtures()
- langfuse_eval.py: sync_journey_fixture_to_dataset()
- cli.py: new 'journey' subcommand (python -m eval journey)
  with --fixture, --models, --judge-model flags
- fixtures/journey_invoice_setup.yaml: example journey fixture
  with 4 user messages and 8 quality criteria
2026-03-23 23:16:41 +01:00
Roberto Musso
d856dfd28c refactor: deduplicate shared code into shared/ module
Move duplicated files from chat + batch-agent into shared/:
- shared/ws_context.py — Redis-based tool call round-trip
- shared/llm.py — LiteLLM factory (get_llm, embed)
- shared/agents/ — 4 domain agents (task, note, project, timeline)

Update all service imports to use shared.* instead of app.*.
Delete 12 duplicated files across both services.
2026-03-23 23:01:45 +01:00
Roberto Musso
ccba54ac24 fix(tracing): use Langfuse compile_prompt with {{variable}} syntax
- tracing.py: add compile_prompt() that uses Langfuse .compile(**vars)
  for {{variable}} substitution, falls back to Python .format() for
  hardcoded {variable} templates
- agent_runner.py: replace _get_system_prompt().format() with
  tracing.compile_prompt() for batch_file_classifier, batch_processing,
  batch_cloud_processing prompts
- journey.py: replace get_prompt + .format() with compile_prompt()
  for journey_system prompt
- chat tracing.py: add compile_prompt() for parity (chat prompts
  currently have no variables, but ready for future use)
- Remove unused _get_system_prompt helper
2026-03-23 22:39:27 +01:00
Roberto Musso
55500cc818 feat(batch-agent): add Langfuse prompt management
- _get_system_prompt helper: fetches managed prompts from Langfuse
  with hardcoded fallback (same pattern as chat service)
- journey.py: journey_system prompt manageable via Langfuse
- agent_runner.py: batch_file_classifier, batch_processing,
  batch_cloud_processing prompts all manageable via Langfuse
- redis_consumer.py: link_prompt_to_trace for all three handlers
2026-03-23 22:30:36 +01:00
Roberto Musso
75a826c9d8 feat(batch-agent): add E2E evaluation harness with Langfuse integration
- eval/mock_executor.py: intercepts execute_on_client, serves fixture
  files from disk, records all mutations (insert/update/delete)
- eval/config.py: YAML fixture loader with prompt variants, expected
  results, seed records, model overrides
- eval/scorer.py: FieldMatchScorer (fuzzy title match, per-field
  accuracy, precision/recall/F1) + LLMJudgeScorer (semantic eval)
- eval/langfuse_eval.py: sync fixtures to Langfuse datasets, create
  dataset runs, post scores, link traces to runs
- eval/runner.py: orchestrates fixture → mock → agent pipeline →
  scoring → Langfuse reporting
- eval/cli.py: CLI (python -m eval run/list/sync) with --models,
  --variants, --fixture, --no-judge flags
- eval/fixtures/: example Italian freelance scenario with 3 prompt
  variants (baseline, detailed_italian, minimal)
2026-03-23 08:54:19 +01:00
Roberto Musso
971f1dd84f feat(batch-agent): integrate Langfuse tracing
- tracing.py: init/shutdown, trace_span, get_langfuse_callback, prompt mgmt
- main.py: init_langfuse at startup, shutdown on teardown
- redis_consumer.py: trace_span around journey_start/message/agent_trigger
- agent_runner.py: thread langfuse_handler through classify + processing LLM
- journey.py: thread langfuse_handler through _call_llm_with_tools
- llm.py: accept callbacks param, forward to LLM constructors
- requirements.txt: add langfuse>=3.0.0
2026-03-23 08:43:15 +01:00
47 changed files with 3658 additions and 737 deletions

3
.gitignore vendored
View File

@@ -35,3 +35,6 @@ Thumbs.db
# Claude Code # Claude Code
.claude/ .claude/
logs/ logs/
# Eval private test data
services/batch-agent/eval/fixtures/private_data/

View File

@@ -27,6 +27,7 @@ class Settings(BaseSettings):
ANTHROPIC_API_KEY: str = "" ANTHROPIC_API_KEY: str = ""
GOOGLE_API_KEY: str = "" GOOGLE_API_KEY: str = ""
CEREBRAS_API_KEY: str = "" CEREBRAS_API_KEY: str = ""
GITHUB_TOKEN: str = ""
LLM_MODEL: str = "gpt-4o" LLM_MODEL: str = "gpt-4o"
LLM_EMBED_MODEL: str = "text-embedding-3-small" LLM_EMBED_MODEL: str = "text-embedding-3-small"

View File

@@ -50,6 +50,8 @@ def _api_key_for_model(model: str) -> str | None:
return settings.GOOGLE_API_KEY or None return settings.GOOGLE_API_KEY or None
if model.startswith("cerebras/"): if model.startswith("cerebras/"):
return settings.CEREBRAS_API_KEY or None return settings.CEREBRAS_API_KEY or None
if model.startswith("github/"):
return settings.GITHUB_TOKEN or None
if model.startswith("github_copilot/"): if model.startswith("github_copilot/"):
# GitHub Copilot uses OAuth device-flow tokens managed by LiteLLM. # GitHub Copilot uses OAuth device-flow tokens managed by LiteLLM.
# No API key is required; returning None lets LiteLLM handle auth. # No API key is required; returning None lets LiteLLM handle auth.
@@ -83,6 +85,9 @@ def get_llm(
if settings.GITHUB_COPILOT_TOKEN_DIR: if settings.GITHUB_COPILOT_TOKEN_DIR:
os.environ.setdefault("GITHUB_COPILOT_TOKEN_DIR", settings.GITHUB_COPILOT_TOKEN_DIR) os.environ.setdefault("GITHUB_COPILOT_TOKEN_DIR", settings.GITHUB_COPILOT_TOKEN_DIR)
if settings.GITHUB_TOKEN:
os.environ.setdefault("GITHUB_TOKEN", settings.GITHUB_TOKEN)
# Use ChatLiteLLM for provider-prefixed models (github_copilot/, anthropic/, etc.) # Use ChatLiteLLM for provider-prefixed models (github_copilot/, anthropic/, etc.)
# so LiteLLM handles routing and auth. ChatOpenAI for plain OpenAI model names. # so LiteLLM handles routing and auth. ChatOpenAI for plain OpenAI model names.
if "/" in model: if "/" in model:

View File

@@ -22,12 +22,13 @@ from langchain_core.messages import AIMessage, HumanMessage, SystemMessage, Tool
from sqlalchemy import select from sqlalchemy import select
from app.agents.filesystem_agent import FILESYSTEM_TOOLS from app.agents.filesystem_agent import FILESYSTEM_TOOLS
from app.agents.note_agent import NOTE_TOOLS from shared.agents.note_agent import NOTE_TOOLS
from app.agents.project_agent import PROJECT_TOOLS from shared.agents.project_agent import PROJECT_TOOLS
from app.agents.task_agent import TASK_TOOLS from shared.agents.task_agent import TASK_TOOLS
from app.agents.timeline_agent import TIMELINE_TOOLS from shared.agents.timeline_agent import TIMELINE_TOOLS
from app.llm import get_llm from shared.llm import get_llm
from app.ws_context import execute_on_client, set_current_user, clear_current_user from shared.ws_context import execute_on_client, set_current_user, clear_current_user
import app.tracing as tracing
from shared.db import async_session from shared.db import async_session
from shared.models import AgentRunLog, CloudAgentConfig, LocalAgentConfig from shared.models import AgentRunLog, CloudAgentConfig, LocalAgentConfig
from shared.redis import redis_client, ws_out_channel from shared.redis import redis_client, ws_out_channel
@@ -193,9 +194,11 @@ async def _run_agent_with_tools(
user_message: str, user_message: str,
tools: list[Any], tools: list[Any],
max_steps: int, max_steps: int,
langfuse_handler: Any | None = None,
) -> str: ) -> str:
"""Run an LLM agent with tool-calling, returning the final text response.""" """Run an LLM agent with tool-calling, returning the final text response."""
llm = get_llm() callbacks = [langfuse_handler] if langfuse_handler else None
llm = get_llm(callbacks=callbacks)
llm_with_tools = llm.bind_tools(tools) llm_with_tools = llm.bind_tools(tools)
messages: list[Any] = [ messages: list[Any] = [
SystemMessage(content=system_prompt), SystemMessage(content=system_prompt),
@@ -396,6 +399,7 @@ async def _classify_file(
file_content: str, file_content: str,
projects: list[dict], projects: list[dict],
config_data_types: list[str], config_data_types: list[str],
langfuse_handler: Any | None = None,
) -> tuple[str, list[str], str | None]: ) -> tuple[str, list[str], str | None]:
fallback: tuple[str, list[str], str | None] = ("new", list(config_data_types), None) fallback: tuple[str, list[str], str | None] = ("new", list(config_data_types), None)
@@ -417,12 +421,16 @@ async def _classify_file(
if d in _DOMAIN_DESCRIPTIONS if d in _DOMAIN_DESCRIPTIONS
) )
system = _STEP1_SYSTEM_PROMPT.format( system = tracing.compile_prompt(
domain_definitions=domain_definitions, "batch_file_classifier",
projects_list=projects_list, fallback=_STEP1_SYSTEM_PROMPT,
variables={
"domain_definitions": domain_definitions,
"projects_list": projects_list,
},
) )
llm = get_llm() llm = get_llm(callbacks=[langfuse_handler] if langfuse_handler else None)
try: try:
response = await llm.ainvoke([ response = await llm.ainvoke([
SystemMessage(content=system), SystemMessage(content=system),
@@ -458,7 +466,7 @@ async def _classify_file(
# ── Local agent runner (two-step per file) ──────────────────────────────── # ── Local agent runner (two-step per file) ────────────────────────────────
async def run_local_agent(user_id: str, trigger_data: dict[str, Any]) -> None: async def run_local_agent(user_id: str, trigger_data: dict[str, Any], *, langfuse_handler: Any | None = None) -> None:
"""Execute a local directory agent run. """Execute a local directory agent run.
In the microservice world, trigger_data is a serialized dict from In the microservice world, trigger_data is a serialized dict from
@@ -552,6 +560,7 @@ async def run_local_agent(user_id: str, trigger_data: dict[str, Any]) -> None:
file_content=file_content, file_content=file_content,
projects=projects, projects=projects,
config_data_types=data_types, config_data_types=data_types,
langfuse_handler=langfuse_handler,
) )
# Step 2 — resolve project_id, fetch entities, process # Step 2 — resolve project_id, fetch entities, process
@@ -593,11 +602,15 @@ async def run_local_agent(user_id: str, trigger_data: dict[str, Any]) -> None:
existing_context = "\n\n".join(existing_blocks) existing_context = "\n\n".join(existing_blocks)
system_prompt = _PROCESSING_SYSTEM_PROMPT.format( system_prompt = tracing.compile_prompt(
existing_context=existing_context, "batch_processing",
project_context=project_context, fallback=_PROCESSING_SYSTEM_PROMPT,
data_types=", ".join(domains), variables={
custom_prompt_section=custom_section, "existing_context": existing_context,
"project_context": project_context,
"data_types": ", ".join(domains),
"custom_prompt_section": custom_section,
},
) )
processing_tools = _build_processing_tools(domains) processing_tools = _build_processing_tools(domains)
@@ -610,6 +623,7 @@ async def run_local_agent(user_id: str, trigger_data: dict[str, Any]) -> None:
), ),
tools=processing_tools, tools=processing_tools,
max_steps=_MAX_PROCESSING_STEPS, max_steps=_MAX_PROCESSING_STEPS,
langfuse_handler=langfuse_handler,
) )
logger.info( logger.info(
"agent_runner: run=%s file=%r result=%s", "agent_runner: run=%s file=%r result=%s",
@@ -660,7 +674,7 @@ async def run_local_agent(user_id: str, trigger_data: dict[str, Any]) -> None:
_CLOUD_DEFAULT_LOOKBACK_DAYS: int = 7 _CLOUD_DEFAULT_LOOKBACK_DAYS: int = 7
async def run_cloud_agent(user_id: str, config_id: str) -> None: async def run_cloud_agent(user_id: str, config_id: str, *, langfuse_handler: Any | None = None) -> None:
"""Execute a cloud connector agent run. """Execute a cloud connector agent run.
Loads the CloudAgentConfig from DB, decrypts OAuth tokens, fetches Loads the CloudAgentConfig from DB, decrypts OAuth tokens, fetches
@@ -776,11 +790,15 @@ async def run_cloud_agent(user_id: str, config_id: str) -> None:
continue continue
items_processed += 1 items_processed += 1
processing_prompt = _CLOUD_PROCESSING_PROMPT.format( processing_prompt = tracing.compile_prompt(
data_types=", ".join(config.data_types), "batch_cloud_processing",
project_context="Determine the appropriate project from the message context.", fallback=_CLOUD_PROCESSING_PROMPT,
file_list=f"Message from {config.provider} (id: {msg.id})", variables={
custom_prompt_section=custom_section, "data_types": ", ".join(config.data_types),
"project_context": "Determine the appropriate project from the message context.",
"file_list": f"Message from {config.provider} (id: {msg.id})",
"custom_prompt_section": custom_section,
},
) )
try: try:
@@ -789,6 +807,7 @@ async def run_cloud_agent(user_id: str, config_id: str) -> None:
user_message=f"Process this message content:\n\n{content_text[:8000]}", user_message=f"Process this message content:\n\n{content_text[:8000]}",
tools=processing_tools, tools=processing_tools,
max_steps=_MAX_PROCESSING_STEPS, max_steps=_MAX_PROCESSING_STEPS,
langfuse_handler=langfuse_handler,
) )
except Exception as exc: except Exception as exc:
errors.append(f"LLM processing error for message {msg.id!r}: {exc}") errors.append(f"LLM processing error for message {msg.id!r}: {exc}")

View File

@@ -9,7 +9,7 @@ from typing import Any
from langchain_core.tools import tool from langchain_core.tools import tool
from app.ws_context import execute_on_client from shared.ws_context import execute_on_client
@tool @tool

View File

@@ -1,110 +0,0 @@
"""Note agent — Markdown note management.
Adapted for Batch Agent Service: import from app.ws_context and app.llm.
"""
from __future__ import annotations
import re
from typing import Any
from langchain_core.tools import tool
from app.llm import embed
from app.ws_context import execute_on_client
_UUID_RE = re.compile(
r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
)
def _is_uuid(value: str) -> bool:
return bool(_UUID_RE.match(value))
@tool
async def list_notes(project_id: str = "") -> str:
"""List notes, optionally scoped to a project by project_id."""
normalized_project_id = project_id if (project_id and _is_uuid(project_id)) else ""
result = await execute_on_client(
action="select",
table="notes",
filters={"projectId": normalized_project_id or None},
)
rows = result.get("rows", [])
if not rows:
return "No notes found."
lines = [f"- {r['title']} (id: {r['id']})" for r in rows]
return f"Found {len(rows)} note(s):\n" + "\n".join(lines)
@tool
async def get_note(note_id: str) -> str:
"""Fetch a single note by its UUID to read its full Markdown content."""
result = await execute_on_client(action="get", table="notes", data={"id": note_id})
row = result.get("row")
if not row:
return f"Note {note_id} not found."
return f"Note '{row['title']}' (id: {row['id']}):\n\n{row['content']}"
@tool
async def create_note(title: str, content: str, project_id: str = "") -> str:
"""Create a new note."""
result = await execute_on_client(
action="insert",
table="notes",
data={
"title": title,
"content": content,
"projectId": project_id or None,
},
)
row = result["row"]
vector = await embed(content)
await execute_on_client(
action="vector_upsert",
data={"id": row["id"], "projectId": row.get("projectId"), "content": content},
vector=vector,
)
return f"Note created: '{row['title']}' (id: {row['id']})."
@tool
async def update_note(note_id: str, title: str = "", content: str = "") -> str:
"""Update an existing note. Only pass fields that should change."""
updates: dict[str, Any] = {}
if title:
updates["title"] = title
if content:
updates["content"] = content
result = await execute_on_client(
action="update",
table="notes",
data={"id": note_id, "updates": updates},
)
row = result["row"]
if content:
vector = await embed(content)
await execute_on_client(
action="vector_upsert",
data={"id": note_id, "projectId": row.get("projectId"), "content": content},
vector=vector,
)
return f"Note updated: '{row['title']}' (id: {row['id']})."
@tool
async def delete_note(note_id: str) -> str:
"""Delete a note permanently by its UUID."""
await execute_on_client(action="delete", table="notes", data={"id": note_id})
return f"Note {note_id} deleted."
NOTE_TOOLS: list[Any] = [
list_notes,
get_note,
create_note,
update_note,
delete_note,
]

View File

@@ -1,110 +0,0 @@
"""Project agent — full lifecycle management.
Adapted for Batch Agent Service: import from app.ws_context.
"""
from __future__ import annotations
from typing import Any
from langchain_core.tools import tool
from app.ws_context import execute_on_client
@tool
async def list_projects(client_id: str = "", include_archived: int = 0) -> str:
"""List projects, optionally filtered by client_id."""
result = await execute_on_client(
action="select",
table="projects",
filters={
"clientId": client_id or None,
"includeArchived": bool(include_archived),
},
)
rows = result.get("rows", [])
if not rows:
return "No projects found."
lines = [f"- {r['name']} (status: {r['status']}, id: {r['id']})" for r in rows]
return f"Found {len(rows)} project(s):\n" + "\n".join(lines)
@tool
async def list_all_projects() -> str:
"""List every project regardless of client or status."""
result = await execute_on_client(action="select", table="projects")
rows = result.get("rows", [])
if not rows:
return "No projects found."
lines = [f"- {r['name']} (status: {r['status']}, id: {r['id']})" for r in rows]
return f"All projects ({len(rows)}):\n" + "\n".join(lines)
@tool
async def get_project(project_id: str) -> str:
"""Fetch a single project by its UUID."""
result = await execute_on_client(action="get", table="projects", data={"id": project_id})
row = result.get("row")
if not row:
return f"Project {project_id} not found."
return (
f"Project: '{row['name']}' (id: {row['id']}, status: {row['status']}, "
f"clientId: {row.get('clientId', 'none')})"
)
@tool
async def create_project(name: str, client_id: str = "") -> str:
"""Create a new project."""
result = await execute_on_client(
action="insert",
table="projects",
data={"name": name, "clientId": client_id or None},
)
row = result["row"]
return f"Project created: '{row['name']}' (id: {row['id']})"
@tool
async def update_project(
project_id: str,
name: str = "",
client_id: str = "",
status: str = "",
ai_summary: str = "",
) -> str:
"""Update a project. Only pass fields that should change."""
updates: dict[str, Any] = {}
if name:
updates["name"] = name
if client_id:
updates["clientId"] = client_id
if status:
updates["status"] = status
if ai_summary:
updates["aiSummary"] = ai_summary
result = await execute_on_client(
action="update",
table="projects",
data={"id": project_id, "updates": updates},
)
row = result["row"]
return f"Project updated: '{row['name']}' (id: {row['id']}, status: {row['status']})"
@tool
async def delete_project(project_id: str) -> str:
"""Permanently delete a project."""
await execute_on_client(action="delete", table="projects", data={"id": project_id})
return f"Project {project_id} permanently deleted."
PROJECT_TOOLS: list[Any] = [
list_projects,
list_all_projects,
get_project,
create_project,
update_project,
delete_project,
]

View File

@@ -1,197 +0,0 @@
"""Task agent — full CRUD for tasks and task comments.
Adapted for Batch Agent Service: import from app.ws_context.
"""
from __future__ import annotations
from datetime import datetime, timezone
import re
from typing import Any
from langchain_core.tools import tool
from app.ws_context import execute_on_client
_UUID_RE = re.compile(
r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
)
def _is_uuid(value: str) -> bool:
return bool(_UUID_RE.match(value))
@tool
async def list_tasks(
project_id: str = "",
status: str = "",
search: str = "",
order_by: str = "",
) -> str:
"""List tasks, optionally filtered by project_id, status, search, or order_by."""
normalized_project_id = project_id if (project_id and _is_uuid(project_id)) else ""
result = await execute_on_client(
action="select",
table="tasks",
filters={
"projectId": normalized_project_id or None,
"status": status or None,
"search": search or None,
"orderBy": order_by or None,
},
)
rows = result.get("rows", [])
if not rows:
return "No tasks found matching the given filters."
lines = [
f"- {r['title']} (status: {r['status']}, priority: {r['priority']}, id: {r['id']})"
for r in rows
]
return f"Found {len(rows)} task(s):\n" + "\n".join(lines)
@tool
async def create_task(
title: str,
description: str = "",
status: str = "todo",
priority: str = "medium",
assignees: str = "[]",
due_date: int = 0,
project_id: str = "",
is_ai_suggested: int = 0,
) -> str:
"""Create a new task."""
result = await execute_on_client(
action="insert",
table="tasks",
data={
"title": title,
"description": description or None,
"status": status,
"priority": priority,
"assignee": assignees,
"dueDate": due_date or None,
"projectId": project_id or None,
"isAiSuggested": is_ai_suggested,
},
)
row = result["row"]
return (
f"Task created: '{row['title']}' "
f"(id: {row['id']}, status: {row['status']}, priority: {row['priority']})"
)
@tool
async def update_task(
task_id: str,
title: str = "",
description: str = "",
status: str = "",
priority: str = "",
assignees: str = "",
due_date: int = -1,
project_id: str = "",
) -> str:
"""Update fields on an existing task. Only pass fields you want to change."""
updates: dict[str, Any] = {}
if title:
updates["title"] = title
if description:
updates["description"] = description
if status:
updates["status"] = status
if priority:
updates["priority"] = priority
if assignees:
updates["assignee"] = assignees
if due_date != -1:
updates["dueDate"] = due_date or None
if project_id:
updates["projectId"] = project_id
result = await execute_on_client(
action="update",
table="tasks",
data={"id": task_id, "updates": updates},
)
row = result["row"]
return f"Task updated: '{row['title']}' (id: {row['id']}, status: {row['status']})"
@tool
async def delete_task(task_id: str) -> str:
"""Delete a task permanently by its UUID."""
await execute_on_client(action="delete", table="tasks", data={"id": task_id})
return f"Task {task_id} deleted."
@tool
async def list_tasks_due_today() -> str:
"""List all tasks whose due date falls on today's date."""
now = datetime.now(tz=timezone.utc)
start_ms = int(datetime(now.year, now.month, now.day, tzinfo=timezone.utc).timestamp() * 1000)
end_ms = start_ms + 86_400_000 - 1
result = await execute_on_client(
action="select",
table="tasks",
filters={"dueDateFrom": start_ms, "dueDateTo": end_ms},
)
rows = result.get("rows", [])
if not rows:
return "No tasks are due today."
lines = [
f"- {r['title']} (priority: {r['priority']}, status: {r['status']}, id: {r['id']})"
for r in rows
]
return f"Tasks due today ({len(rows)}):\n" + "\n".join(lines)
@tool
async def list_task_comments(task_id: str) -> str:
"""List all comments on a task by its UUID."""
result = await execute_on_client(
action="select",
table="taskComments",
filters={"taskId": task_id},
)
rows = result.get("rows", [])
if not rows:
return f"No comments found for task {task_id}."
lines = [f"- [{r['author']}]: {r['content']} (id: {r['id']})" for r in rows]
return f"Found {len(rows)} comment(s):\n" + "\n".join(lines)
@tool
async def add_task_comment(task_id: str, author: str, content: str) -> str:
"""Add a comment to a task."""
result = await execute_on_client(
action="insert",
table="taskComments",
data={"taskId": task_id, "author": author, "content": content},
)
row = result.get("row", {})
row_author = row.get("author", author)
row_task_id = row.get("taskId") or row.get("task_id") or task_id
row_comment_id = row.get("id", "unknown")
return f"Comment added by {row_author} on task {row_task_id} (comment id: {row_comment_id})."
@tool
async def delete_task_comment(comment_id: str) -> str:
"""Delete a task comment by its UUID."""
await execute_on_client(action="delete", table="taskComments", data={"id": comment_id})
return f"Comment {comment_id} deleted."
TASK_TOOLS: list[Any] = [
list_tasks,
create_task,
update_task,
delete_task,
list_tasks_due_today,
list_task_comments,
add_task_comment,
delete_task_comment,
]

View File

@@ -1,88 +0,0 @@
"""Timeline agent — project milestone management.
Adapted for Batch Agent Service: import from app.ws_context.
"""
from __future__ import annotations
import re
from typing import Any
from langchain_core.tools import tool
from app.ws_context import execute_on_client
_UUID_RE = re.compile(
r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
)
def _is_uuid(value: str) -> bool:
return bool(_UUID_RE.match(value))
@tool
async def list_timelines(project_id: str = "") -> str:
"""List timelines. Provide project_id to scope to a specific project."""
normalized_project_id = project_id if (project_id and _is_uuid(project_id)) else ""
result = await execute_on_client(
action="select",
table="timelines",
filters={"projectId": normalized_project_id or None},
)
rows = result.get("rows", [])
if not rows:
return "No timelines found."
lines = [f"- {r['title']} (date: {r['date']}, id: {r['id']})" for r in rows]
return f"Found {len(rows)} timeline(s):\n" + "\n".join(lines)
@tool
async def create_timeline(
project_id: str, title: str, date: int, is_ai_suggested: int = 0,
) -> str:
"""Create a project timeline (milestone)."""
result = await execute_on_client(
action="insert",
table="timelines",
data={
"projectId": project_id,
"title": title,
"date": date,
"isAiSuggested": is_ai_suggested,
},
)
row = result["row"]
return f"Timeline created: '{row['title']}' (id: {row['id']}, date: {row['date']})"
@tool
async def update_timeline(timeline_id: str, title: str = "", date: int = -1) -> str:
"""Update a timeline. Only pass fields that should change."""
updates: dict[str, Any] = {}
if title:
updates["title"] = title
if date != -1:
updates["date"] = date
result = await execute_on_client(
action="update",
table="timelines",
data={"id": timeline_id, "updates": updates},
)
row = result["row"]
return f"Timeline updated: '{row['title']}' (id: {row['id']})"
@tool
async def delete_timeline(timeline_id: str) -> str:
"""Delete a timeline permanently by its UUID."""
await execute_on_client(action="delete", table="timelines", data={"id": timeline_id})
return f"Timeline {timeline_id} deleted."
TIMELINE_TOOLS: list[Any] = [
list_timelines,
create_timeline,
update_timeline,
delete_timeline,
]

View File

@@ -25,7 +25,8 @@ from typing import Any
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage, ToolMessage from langchain_core.messages import AIMessage, HumanMessage, SystemMessage, ToolMessage
from app.agents.filesystem_agent import FILESYSTEM_TOOLS from app.agents.filesystem_agent import FILESYSTEM_TOOLS
from app.llm import get_llm from shared.llm import get_llm
import app.tracing as tracing
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -79,17 +80,9 @@ def get_journey_session(session_id: str, user_id: str) -> JourneySession | None:
_SYSTEM_PROMPT_TEMPLATE = """\ _SYSTEM_PROMPT_TEMPLATE = """\
You are a friendly assistant helping a freelancer configure a data-extraction agent. You are a friendly assistant helping a freelancer configure a data-extraction agent.
Your job is to understand exactly what data the user wants to extract from their Your job is to understand exactly what data the user wants to extract from their
local directory and produce a detailed prompt_template that a separate AI will use local directory and produce a concise prompt_template that a separate AI will use
as its instruction set. as its instruction set.
The extraction agent already has this base behaviour built in:
- Reads each file using file-system tools.
- Creates records (tasks, notes, timelines, projects) via CRUD tools.
- Sets isAiSuggested=1 on every new record.
- Only extracts data explicitly present in the files — it never invents information.
The user's custom prompt is appended AFTER this base behaviour, so focus on
what to look for and how to map it — not on the general extraction mechanics.
You have access to file-system tools to explore the user's directory: You have access to file-system tools to explore the user's directory:
- list_directory: to see folder structure - list_directory: to see folder structure
- read_file_content: to peek at file contents - read_file_content: to peek at file contents
@@ -98,38 +91,43 @@ You have access to file-system tools to explore the user's directory:
The user's configured directory is: {directory} The user's configured directory is: {directory}
Target data types: {data_types} Target data types: {data_types}
IMPORTANT — project assignment is handled automatically by the main agent runner IMPORTANT — project assignment is handled automatically. You MUST NOT ask the user
before the custom prompt is ever used. You MUST NOT ask the user about projects, about projects, projectId, or how to link records to projects. Never include
projectId, or how to link records to projects. Never include projectId logic or projectId logic or project creation instructions in the generated prompt_template.
project creation instructions in the generated prompt_template.
Start by exploring the directory to understand its structure. Then ask concise, Start by exploring the directory to understand its structure. Then ask concise,
focused questions one at a time. Cover these topics (not necessarily in this order): focused questions one at a time. Cover only the topics relevant to the target
1. The type and format of the source content (confirmed by your exploration). data types listed above:
2. How fields should be mapped (e.g. filename → task title).
3. Priority or status rules (e.g. "urgent" keyword → high priority).
4. Any special handling, date extraction, or exclusions.
Once you reach 90% confidence, output the final prompt_template between these exact 1. Content type and format — confirmed by your exploration.
markers on their own lines: 2. For TASKS (if in scope): field mapping for title, status, priority, content,
dueDate (where is the date found? what's the fallback when absent?),
and assignee (is there a person name to assign?).
3. For NOTES when TASKS are also in scope: note vs task distinction —
what makes something a note rather than a task?
4. For TIMELINES (if in scope): the date source — what marks a milestone or event?
5. Exclusions and special handling applicable to the target data types.
Keep asking focused questions until you are at least 90% confident. Then stop and
output the final prompt_template immediately, wrapped between these exact markers
on their own lines:
{template_start} {template_start}
<the complete extraction prompt here> <the complete extraction prompt here>
{template_end} {template_end}
The prompt_template must be a self-contained instruction for an AI that reads files The prompt_template must be concise (bullet points, ~1525 lines maximum).
and must perform CRUD operations using tools to create records. It should specify: Specify only:
- What entity types to create (tasks, notes, timelines) — never projects. - Scope: what files/content qualify and what entity types to create.
- How to map file content to record fields (camelCase: title, status, priority, - Field mapping rules per entity type (camelCase fields: title, status, priority,
dueDate, content, etc.) — never include projectId. dueDate, content, assignee, etc.).
- That isAiSuggested must be set to 1 on every new record. - dueDate rule (if tasks in scope): source and fallback behaviour.
- Concrete examples of mappings based on what you discovered in the directory. - Note vs task rule (if both in scope): the criterion that separates them.
- Timeline date rule (if timelines in scope): what constitutes a timeline event.
- Exclusion/filtering rules.
- 23 concrete mapping examples based on what you discovered.
{existing_section}\ {existing_section}Begin by exploring the directory, then ask your first question.\
Keep asking clarifying questions until you are at least 90% confident you have
enough information to generate an accurate prompt_template. Once you reach that
confidence level, stop asking and produce the final template immediately.
Begin by exploring the directory, then ask your first question.\
""" """
@@ -144,12 +142,15 @@ def _build_system_prompt(
if existing_template if existing_template
else "" else ""
) )
return _SYSTEM_PROMPT_TEMPLATE.format( # Use Langfuse compile_prompt ({{variable}} syntax) with Python .format() fallback
directory=directory, return tracing.compile_prompt(
data_types=", ".join(data_types), "journey_system",
template_start=_TEMPLATE_START, fallback=_SYSTEM_PROMPT_TEMPLATE,
template_end=_TEMPLATE_END, variables={
existing_section=existing_section, "directory": directory,
"data_types": ", ".join(data_types),
"existing_section": existing_section,
},
) )
@@ -190,6 +191,7 @@ async def _call_llm_with_tools(
system_prompt: str, system_prompt: str,
history: list[dict[str, Any]], history: list[dict[str, Any]],
tools: list[Any], tools: list[Any],
langfuse_handler: Any | None = None,
) -> str: ) -> str:
"""Build LangChain messages from history and invoke the LLM with tools. """Build LangChain messages from history and invoke the LLM with tools.
@@ -203,7 +205,8 @@ async def _call_llm_with_tools(
else: else:
messages.append(AIMessage(content=turn["content"])) messages.append(AIMessage(content=turn["content"]))
llm = get_llm(model=None, temperature=0.4) callbacks = [langfuse_handler] if langfuse_handler else None
llm = get_llm(model=None, temperature=0.4, callbacks=callbacks)
llm_with_tools = llm.bind_tools(tools) llm_with_tools = llm.bind_tools(tools)
tool_map = {tool_def.name: tool_def for tool_def in tools} tool_map = {tool_def.name: tool_def for tool_def in tools}
@@ -247,6 +250,8 @@ async def _call_llm_with_tools(
async def handle_journey_start( async def handle_journey_start(
user_id: str, user_id: str,
frame: dict[str, Any], frame: dict[str, Any],
*,
langfuse_handler: Any | None = None,
) -> dict[str, Any]: ) -> dict[str, Any]:
"""Handle a ``journey_start`` request. """Handle a ``journey_start`` request.
@@ -277,6 +282,7 @@ async def handle_journey_start(
system_prompt=system_prompt, system_prompt=system_prompt,
history=seed_history, history=seed_history,
tools=list(FILESYSTEM_TOOLS), tools=list(FILESYSTEM_TOOLS),
langfuse_handler=langfuse_handler,
) )
session.history.extend(seed_history) session.history.extend(seed_history)
@@ -313,6 +319,8 @@ async def handle_journey_start(
async def handle_journey_message( async def handle_journey_message(
user_id: str, user_id: str,
frame: dict[str, Any], frame: dict[str, Any],
*,
langfuse_handler: Any | None = None,
) -> dict[str, Any]: ) -> dict[str, Any]:
"""Handle a ``journey_message`` request. """Handle a ``journey_message`` request.
@@ -338,6 +346,7 @@ async def handle_journey_message(
system_prompt=session.system_prompt, system_prompt=session.system_prompt,
history=session.history, history=session.history,
tools=list(FILESYSTEM_TOOLS), tools=list(FILESYSTEM_TOOLS),
langfuse_handler=langfuse_handler,
) )
session.history.append({"role": "assistant", "content": ai_reply}) session.history.append({"role": "assistant", "content": ai_reply})
@@ -358,6 +367,7 @@ async def handle_journey_message(
system_prompt=session.system_prompt, system_prompt=session.system_prompt,
history=session.history, history=session.history,
tools=list(FILESYSTEM_TOOLS), tools=list(FILESYSTEM_TOOLS),
langfuse_handler=langfuse_handler,
) )
session.history.append({"role": "assistant", "content": nudge_reply}) session.history.append({"role": "assistant", "content": nudge_reply})

View File

@@ -32,6 +32,8 @@ def _api_key_for_model(model: str) -> str | None:
return settings.GOOGLE_API_KEY or None return settings.GOOGLE_API_KEY or None
if model.startswith("cerebras/"): if model.startswith("cerebras/"):
return settings.CEREBRAS_API_KEY or None return settings.CEREBRAS_API_KEY or None
if model.startswith("github/"):
return settings.GITHUB_TOKEN or None
if model.startswith("github_copilot/"): if model.startswith("github_copilot/"):
return None return None
return settings.OPENAI_API_KEY or None return settings.OPENAI_API_KEY or None
@@ -41,29 +43,27 @@ def get_llm(
*, *,
model: str | None = None, model: str | None = None,
temperature: float = 0, temperature: float = 0,
callbacks: list | None = None,
) -> ChatOpenAI | ChatLiteLLM: ) -> ChatOpenAI | ChatLiteLLM:
model = model or settings.LLM_MODEL model = model or settings.LLM_MODEL
if settings.GITHUB_COPILOT_TOKEN_DIR: if settings.GITHUB_COPILOT_TOKEN_DIR:
os.environ.setdefault("GITHUB_COPILOT_TOKEN_DIR", settings.GITHUB_COPILOT_TOKEN_DIR) os.environ.setdefault("GITHUB_COPILOT_TOKEN_DIR", settings.GITHUB_COPILOT_TOKEN_DIR)
if settings.GITHUB_TOKEN:
os.environ.setdefault("GITHUB_TOKEN", settings.GITHUB_TOKEN)
if "/" in model: if "/" in model:
return ChatLiteLLM(model=model, temperature=temperature) return ChatLiteLLM(model=model, temperature=temperature, callbacks=callbacks)
return ChatOpenAI( return ChatOpenAI(
model=model, model=model,
temperature=temperature, temperature=temperature,
api_key=_api_key_for_model(model), api_key=_api_key_for_model(model),
callbacks=callbacks,
) )
def get_router_llm(
*,
temperature: float = 0,
) -> ChatOpenAI | ChatLiteLLM:
return get_llm(model=settings.LLM_ROUTER_MODEL, temperature=temperature)
async def embed(text: str) -> list[float]: async def embed(text: str) -> list[float]:
model = settings.LLM_EMBED_MODEL model = settings.LLM_EMBED_MODEL

View File

@@ -14,6 +14,14 @@ from __future__ import annotations
import asyncio import asyncio
import logging import logging
import sys
from pathlib import Path
# Ensure the repo root is on sys.path so ``shared`` is importable when
# running locally (in Docker the COPY already places it at /app/shared/).
_repo_root = str(Path(__file__).resolve().parents[3])
if _repo_root not in sys.path:
sys.path.insert(0, _repo_root)
from contextlib import asynccontextmanager from contextlib import asynccontextmanager
from typing import AsyncGenerator from typing import AsyncGenerator
@@ -29,6 +37,10 @@ logger = logging.getLogger(__name__)
@asynccontextmanager @asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]: async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
# Initialise Langfuse tracing (no-op if keys are missing)
from app.tracing import init_langfuse
init_langfuse()
logger.info("batch-agent: starting Redis consumer") logger.info("batch-agent: starting Redis consumer")
task = asyncio.create_task(start_consumer()) task = asyncio.create_task(start_consumer())
yield yield
@@ -37,6 +49,16 @@ async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
await task await task
except asyncio.CancelledError: except asyncio.CancelledError:
pass pass
from app.tracing import shutdown as shutdown_langfuse
shutdown_langfuse()
from shared.db import engine
await engine.dispose()
from shared.redis import redis_client
await redis_client.aclose()
logger.info("batch-agent: Redis consumer stopped") logger.info("batch-agent: Redis consumer stopped")

View File

@@ -17,7 +17,8 @@ from typing import Any
from shared.redis import redis_client, batch_request_channel, ws_out_channel from shared.redis import redis_client, batch_request_channel, ws_out_channel
from app.ws_context import set_current_user, clear_current_user import app.tracing as tracing
from shared.ws_context import set_current_user, clear_current_user
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -32,15 +33,28 @@ async def _handle_journey_start(user_id: str, data: dict[str, Any]) -> None:
"""Handle a journey_start request from WS Gateway.""" """Handle a journey_start request from WS Gateway."""
from app.journey import handle_journey_start from app.journey import handle_journey_start
session_id = data.get("session_id", "")
set_current_user(user_id) set_current_user(user_id)
try: try:
reply = await handle_journey_start(user_id, data) with tracing.trace_span(
name="journey_start",
user_id=user_id,
session_id=session_id,
input=data.get("directory", ""),
metadata={"data_types": data.get("data_types", [])},
tags=["journey"],
) as span:
langfuse_handler = tracing.get_langfuse_callback()
reply = await handle_journey_start(user_id, data, langfuse_handler=langfuse_handler)
tracing.link_prompt_to_trace(span, "journey_system")
span.update(output=reply.get("message", "")[:500])
await _publish_to_user(user_id, reply) await _publish_to_user(user_id, reply)
tracing.flush()
except Exception as exc: except Exception as exc:
logger.error("batch-agent: journey_start failed user=%s: %s", user_id, exc) logger.error("batch-agent: journey_start failed user=%s: %s", user_id, exc)
await _publish_to_user(user_id, { await _publish_to_user(user_id, {
"type": "journey_reply", "type": "journey_reply",
"session_id": data.get("session_id", ""), "session_id": session_id,
"message": f"Journey setup failed: {exc}", "message": f"Journey setup failed: {exc}",
"done": True, "done": True,
"prompt_template": None, "prompt_template": None,
@@ -53,15 +67,27 @@ async def _handle_journey_message(user_id: str, data: dict[str, Any]) -> None:
"""Handle a journey_message from WS Gateway.""" """Handle a journey_message from WS Gateway."""
from app.journey import handle_journey_message from app.journey import handle_journey_message
session_id = data.get("session_id", "")
set_current_user(user_id) set_current_user(user_id)
try: try:
reply = await handle_journey_message(user_id, data) with tracing.trace_span(
name="journey_message",
user_id=user_id,
session_id=session_id,
input=data.get("message", "")[:200],
tags=["journey"],
) as span:
langfuse_handler = tracing.get_langfuse_callback()
reply = await handle_journey_message(user_id, data, langfuse_handler=langfuse_handler)
tracing.link_prompt_to_trace(span, "journey_system")
span.update(output=reply.get("message", "")[:500])
await _publish_to_user(user_id, reply) await _publish_to_user(user_id, reply)
tracing.flush()
except Exception as exc: except Exception as exc:
logger.error("batch-agent: journey_message failed user=%s: %s", user_id, exc) logger.error("batch-agent: journey_message failed user=%s: %s", user_id, exc)
await _publish_to_user(user_id, { await _publish_to_user(user_id, {
"type": "journey_reply", "type": "journey_reply",
"session_id": data.get("session_id", ""), "session_id": session_id,
"message": f"Journey processing failed: {exc}", "message": f"Journey processing failed: {exc}",
"done": True, "done": True,
"prompt_template": None, "prompt_template": None,
@@ -74,15 +100,29 @@ async def _handle_agent_trigger(user_id: str, data: dict[str, Any]) -> None:
"""Handle an agent_trigger request from the REST route (forwarded via Redis).""" """Handle an agent_trigger request from the REST route (forwarded via Redis)."""
from app.agent_runner import run_local_agent from app.agent_runner import run_local_agent
run_context = data.get("run_context", {})
agent_id = run_context.get("agent_id", "")
set_current_user(user_id) set_current_user(user_id)
try: try:
await run_local_agent(user_id, data) with tracing.trace_span(
name="agent_trigger",
user_id=user_id,
trace_id=run_context.get("run_id"),
input={"agent_id": agent_id, "directory": data.get("directory", "")},
metadata={"data_types": data.get("data_types", [])},
tags=["batch", "agent_run"],
) as span:
langfuse_handler = tracing.get_langfuse_callback()
await run_local_agent(user_id, data, langfuse_handler=langfuse_handler)
tracing.link_prompt_to_trace(span, "batch_processing")
span.update(output={"status": "completed"})
tracing.flush()
except Exception as exc: except Exception as exc:
logger.error("batch-agent: agent_trigger failed user=%s: %s", user_id, exc) logger.error("batch-agent: agent_trigger failed user=%s: %s", user_id, exc)
await _publish_to_user(user_id, { await _publish_to_user(user_id, {
"type": "run_complete", "type": "run_complete",
"status": "error", "status": "error",
"run_context": data.get("run_context", {}), "run_context": run_context,
}) })
finally: finally:
clear_current_user() clear_current_user()
@@ -98,6 +138,8 @@ async def _dispatch(user_id: str, message_data: dict[str, Any]) -> None:
await _handle_journey_message(user_id, message_data) await _handle_journey_message(user_id, message_data)
elif msg_type == "agent_trigger": elif msg_type == "agent_trigger":
await _handle_agent_trigger(user_id, message_data) await _handle_agent_trigger(user_id, message_data)
elif msg_type == "device_online":
logger.info("batch-agent: device_online user=%s device=%s", user_id, message_data.get("device_id", "?"))
else: else:
logger.warning("batch-agent: unknown message type %r from user=%s", msg_type, user_id) logger.warning("batch-agent: unknown message type %r from user=%s", msg_type, user_id)

View File

@@ -0,0 +1,336 @@
"""Langfuse tracing & prompt management for the Batch Agent Service (v4 SDK).
Provides:
- ``init_langfuse()`` — initialise the singleton client at startup
- ``trace_span()`` — context manager that creates a trace + span
- ``get_langfuse_callback()`` — LangChain callback handler (auto-inherits trace)
- ``get_prompt()`` — fetch a managed prompt from Langfuse by name
- ``flush()`` / ``shutdown()`` — lifecycle management
All functions gracefully degrade to no-ops when Langfuse is not configured,
so the service works identically with or without observability keys.
Requires ``langfuse >= 3.0.0`` (v4 / "Fast Preview" SDK).
"""
from __future__ import annotations
import logging
from contextlib import contextmanager
from typing import Any
from shared.config import settings
logger = logging.getLogger(__name__)
# ── State ────────────────────────────────────────────────────────────────
_initialised: bool = False
_disabled: bool = False
def _is_configured() -> bool:
return bool(settings.LANGFUSE_SECRET_KEY and settings.LANGFUSE_PUBLIC_KEY)
def init_langfuse() -> None:
"""Initialise the Langfuse singleton. Call once at startup."""
global _initialised, _disabled
if _initialised or _disabled:
return
if not _is_configured():
_disabled = True
logger.info("tracing: Langfuse keys not set — tracing disabled")
return
try:
from langfuse import Langfuse
Langfuse(
secret_key=settings.LANGFUSE_SECRET_KEY,
public_key=settings.LANGFUSE_PUBLIC_KEY,
host=settings.LANGFUSE_HOST,
)
_initialised = True
logger.info("tracing: Langfuse client initialised (host=%s)", settings.LANGFUSE_HOST)
except Exception as exc:
_disabled = True
logger.warning("tracing: failed to initialise Langfuse: %s", exc)
def _get_client() -> Any | None:
"""Return the singleton Langfuse client, or *None* if disabled."""
if _disabled:
return None
if not _initialised:
init_langfuse()
if _disabled:
return None
try:
from langfuse import get_client
return get_client()
except Exception:
return None
# ── Null span (no-op when Langfuse is disabled) ─────────────────────────
class _NullSpan:
"""Drop-in replacement when Langfuse is disabled."""
def update(self, **_: Any) -> None: ...
def set_trace_io(self, **_: Any) -> None: ...
def score_trace(self, **_: Any) -> None: ...
# ── Trace context manager ───────────────────────────────────────────────
@contextmanager
def trace_span(
*,
name: str,
user_id: str,
session_id: str | None = None,
trace_id: str | None = None,
input: Any = None,
metadata: dict[str, Any] | None = None,
tags: list[str] | None = None,
):
"""Context manager that creates a Langfuse trace/span.
Yields the span object (or a ``_NullSpan`` if Langfuse is disabled).
A ``CallbackHandler`` created inside this block auto-inherits the trace
context, so there is no need to pass trace IDs manually.
"""
lf = _get_client()
if lf is None:
yield _NullSpan()
return
try:
from langfuse import Langfuse, propagate_attributes
trace_ctx: dict[str, str] = {}
if trace_id is not None:
trace_ctx["trace_id"] = Langfuse.create_trace_id(seed=trace_id)
with lf.start_as_current_observation(
as_type="span",
name=name,
input=input,
metadata=metadata or {},
**({"trace_context": trace_ctx} if trace_ctx else {}),
) as span:
with propagate_attributes(
user_id=user_id,
session_id=session_id,
tags=tags or [],
):
yield span
except Exception as exc:
logger.warning("tracing: trace_span(%s) failed: %s", name, exc)
yield _NullSpan()
# ── LangChain callback handler ──────────────────────────────────────────
def get_langfuse_callback() -> Any | None:
"""Return a LangChain ``CallbackHandler`` that auto-inherits the current trace.
Must be called inside a ``trace_span()`` block for proper linking.
Returns *None* when Langfuse is disabled.
"""
if _disabled and not _initialised:
return None
try:
from langfuse.langchain import CallbackHandler
return CallbackHandler()
except Exception as exc:
logger.warning("tracing: get_langfuse_callback failed: %s", exc)
return None
# ── Prompt management ────────────────────────────────────────────────────
def get_prompt(
name: str,
*,
version: int | None = None,
label: str | None = None,
fallback: str | None = None,
cache_ttl_seconds: int = 300,
) -> str | None:
"""Fetch a managed prompt from Langfuse by name (without variable compilation).
Returns the raw prompt string, or *fallback* if the prompt is not
found or Langfuse is disabled.
"""
lf = _get_client()
if lf is None:
return fallback
try:
kwargs: dict[str, Any] = {
"name": name,
"cache_ttl_seconds": cache_ttl_seconds,
}
if version is not None:
kwargs["version"] = version
if label is not None:
kwargs["label"] = label
prompt = lf.get_prompt(**kwargs)
return prompt.prompt
except Exception as exc:
logger.warning("tracing: get_prompt(%s) failed: %s", name, exc)
return fallback
def compile_prompt(
name: str,
*,
fallback: str,
variables: dict[str, str],
version: int | None = None,
label: str | None = None,
cache_ttl_seconds: int = 300,
) -> str:
"""Fetch a managed prompt from Langfuse and compile it with ``{{variables}}``.
If the prompt exists in Langfuse, uses the SDK's ``.compile(**variables)``
which replaces ``{{key}}`` placeholders. If Langfuse is disabled or the
prompt is not found, falls back to ``fallback.format(**variables)`` (Python
``{key}`` placeholders).
This means:
- Langfuse prompts use ``{{variable}}`` syntax.
- Hardcoded fallback strings use Python ``{variable}`` syntax.
"""
lf = _get_client()
if lf is None:
return fallback.format(**variables)
try:
kwargs: dict[str, Any] = {
"name": name,
"cache_ttl_seconds": cache_ttl_seconds,
}
if version is not None:
kwargs["version"] = version
if label is not None:
kwargs["label"] = label
prompt = lf.get_prompt(**kwargs)
return prompt.compile(**variables)
except Exception as exc:
logger.warning("tracing: compile_prompt(%s) failed, using fallback: %s", name, exc)
return fallback.format(**variables)
def get_prompt_object(
name: str,
*,
version: int | None = None,
label: str | None = None,
cache_ttl_seconds: int = 300,
) -> Any | None:
"""Fetch the raw Langfuse prompt *object* (not the compiled string).
Returns ``None`` when Langfuse is disabled or the prompt is not found.
Use this when you need to pass the prompt to ``start_observation(prompt=...)``
for linking the prompt to a trace in the Langfuse UI.
"""
lf = _get_client()
if lf is None:
return None
try:
kwargs: dict[str, Any] = {
"name": name,
"cache_ttl_seconds": cache_ttl_seconds,
}
if version is not None:
kwargs["version"] = version
if label is not None:
kwargs["label"] = label
return lf.get_prompt(**kwargs)
except Exception as exc:
logger.warning("tracing: get_prompt_object(%s) failed: %s", name, exc)
return None
def link_prompt_to_trace(
span: Any,
prompt_name: str,
*,
version: int | None = None,
label: str | None = None,
) -> None:
"""Link a Langfuse managed prompt to a span/observation.
Uses the SDK v4 ``prompt=`` parameter so that the prompt version
appears linked in the Langfuse UI with metrics tracking.
"""
lf = _get_client()
if lf is None or isinstance(span, _NullSpan):
return
try:
prompt = get_prompt_object(prompt_name, version=version, label=label)
if prompt is not None:
span.update(prompt=prompt)
except Exception as exc:
logger.warning("tracing: link_prompt_to_trace(%s) failed: %s", prompt_name, exc)
# ── Scoring helper ───────────────────────────────────────────────────────
def score_trace(
trace_id: str,
name: str,
value: float,
*,
comment: str | None = None,
) -> None:
"""Post a score to a trace (e.g. user feedback, latency, quality)."""
lf = _get_client()
if lf is None:
return
try:
lf.create_score(trace_id=trace_id, name=name, value=value, comment=comment)
except Exception as exc:
logger.warning("tracing: score_trace failed: %s", exc)
# ── Shutdown ─────────────────────────────────────────────────────────────
def flush() -> None:
"""Flush pending Langfuse events."""
lf = _get_client()
if lf is not None:
try:
lf.flush()
except Exception as exc:
logger.warning("tracing: flush failed: %s", exc)
def shutdown() -> None:
"""Flush and close the Langfuse client."""
global _initialised, _disabled
lf = _get_client()
if lf is not None:
try:
lf.flush()
lf.shutdown()
except Exception as exc:
logger.warning("tracing: shutdown failed: %s", exc)
_initialised = False
_disabled = False

View File

@@ -0,0 +1 @@
"""Batch Agent E2E evaluation harness."""

View File

@@ -0,0 +1,5 @@
"""Allow running the eval package as ``python -m eval``."""
from eval.cli import main
main()

View File

@@ -0,0 +1,285 @@
"""CLI entry point for the batch agent evaluation harness.
Usage::
# From services/batch-agent/:
python -m eval run # all agent fixtures, default model
python -m eval run --fixture=classify-invoices # single fixture
python -m eval run --models=gpt-4o,gpt-5.3-codex # multiple models
python -m eval run --mode=step1 # only step1 fixtures
python -m eval run --no-judge # skip LLM judge scoring
python -m eval interactive # interactive journey session
python -m eval interactive --fixture=journey-invoice-setup
python -m eval interactive --model=gpt-4o
python -m eval interactive --judge-model=github_copilot/gpt-4o-mini
python -m eval list # list all fixtures
python -m eval sync # sync fixtures to Langfuse datasets
"""
from __future__ import annotations
import argparse
import asyncio
import logging
import sys
from pathlib import Path
# Ensure the service root and repo root are in sys.path.
# Service root must come BEFORE repo root so its ``app/`` package
# shadows the monolith ``app/`` in the repo root.
_SERVICE_ROOT = Path(__file__).resolve().parent.parent
_REPO_ROOT = _SERVICE_ROOT.parent.parent
_sr = str(_SERVICE_ROOT)
_rr = str(_REPO_ROOT)
if _rr not in sys.path:
sys.path.insert(0, _rr)
# Always force service root to position 0 (python -m may have already
# added CWD further down the list, which loses to repo root).
if _sr in sys.path:
sys.path.remove(_sr)
sys.path.insert(0, _sr)
from eval.config import discover_fixtures, discover_journey_fixtures
from eval.runner import run_fixture_eval, print_results
from eval.interactive import run_interactive
from eval import langfuse_eval
def _setup_logging(verbose: bool) -> None:
level = logging.DEBUG if verbose else logging.INFO
logging.basicConfig(
level=level,
format="%(asctime)s %(name)-20s %(levelname)-5s %(message)s",
datefmt="%H:%M:%S",
)
# Quiet noisy libraries
for name in ("httpx", "httpcore", "openai", "litellm", "urllib3"):
logging.getLogger(name).setLevel(logging.WARNING)
def _parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Batch Agent E2E evaluation harness",
prog="python -m eval",
)
sub = parser.add_subparsers(dest="command", required=True)
# ── run ───────────────────────────────────────────────────────
run_cmd = sub.add_parser("run", help="Run evaluations")
run_cmd.add_argument(
"--fixture", "-f",
help="Run only the named fixture (default: all)",
)
run_cmd.add_argument(
"--models", "-m",
default="github_copilot/gpt-5.3-codex",
help="Comma-separated list of models to test (default: github_copilot/gpt-5.3-codex)",
)
run_cmd.add_argument(
"--mode",
default=None,
choices=["step1", "step2", "full"],
help="Only run fixtures with this mode (default: all)",
)
run_cmd.add_argument(
"--no-judge",
action="store_true",
help="Skip LLM-as-judge scoring",
)
run_cmd.add_argument(
"--judge-model",
default="gpt-4o",
help="Model for LLM judge (default: gpt-4o)",
)
run_cmd.add_argument(
"--fixtures-dir",
default=None,
help="Path to fixtures directory (default: eval/fixtures/)",
)
run_cmd.add_argument("-v", "--verbose", action="store_true")
# ── list ──────────────────────────────────────────────────────
list_cmd = sub.add_parser("list", help="List available fixtures")
list_cmd.add_argument("--fixtures-dir", default=None)
list_cmd.add_argument("-v", "--verbose", action="store_true")
# ── sync ──────────────────────────────────────────────────────
sync_cmd = sub.add_parser("sync", help="Sync fixtures to Langfuse datasets")
sync_cmd.add_argument("--fixture", "-f", default=None, help="Sync only the named fixture")
sync_cmd.add_argument("--fixtures-dir", default=None)
sync_cmd.add_argument("-v", "--verbose", action="store_true")
# ── interactive ───────────────────────────────────────────────
inter_cmd = sub.add_parser("interactive", help="Interactive journey session (human-in-the-loop)")
inter_cmd.add_argument(
"--fixture", "-f",
help="Journey fixture to use (default: pick interactively)",
)
inter_cmd.add_argument(
"--model", "-m",
default="github_copilot/gpt-5.3-codex",
help="Model for the journey AI (default: github_copilot/gpt-5.3-codex)",
)
inter_cmd.add_argument(
"--judge-model",
default="gpt-4o",
help="Model for LLM judge (default: gpt-4o)",
)
inter_cmd.add_argument(
"--fixtures-dir",
default=None,
help="Path to fixtures directory (default: eval/fixtures/)",
)
inter_cmd.add_argument(
"--data-dir",
default=None,
help="Override sample data directory (e.g. path to private test files not in git)",
)
inter_cmd.add_argument("-v", "--verbose", action="store_true")
return parser.parse_args()
def _fixtures_dir(arg: str | None) -> Path | None:
if arg:
return Path(arg)
return None
async def _cmd_run(args: argparse.Namespace) -> None:
fixtures = discover_fixtures(_fixtures_dir(args.fixtures_dir))
if not fixtures:
print("No fixtures found. Create YAML files in eval/fixtures/.")
return
if args.fixture:
fixtures = [f for f in fixtures if f.name == args.fixture]
if not fixtures:
print(f"Fixture '{args.fixture}' not found.")
return
models = [m.strip() for m in args.models.split(",")]
all_results = []
for fixture in fixtures:
if args.mode and fixture.mode != args.mode:
continue
results = await run_fixture_eval(
fixture,
models=models,
use_llm_judge=not args.no_judge,
judge_model=args.judge_model,
)
all_results.extend(results)
print_results(all_results)
def _cmd_list(args: argparse.Namespace) -> None:
fixtures = discover_fixtures(_fixtures_dir(args.fixtures_dir))
journey_fixtures = discover_journey_fixtures(_fixtures_dir(args.fixtures_dir))
if not fixtures and not journey_fixtures:
print("No fixtures found.")
return
if fixtures:
print(f"\n{'[Agent Fixtures]'}")
print(f"{'Name':<30} {'Mode':<6} {'Types':<25} {'Expected'}")
print("-" * 90)
for f in fixtures:
types = ", ".join(f.data_types)
n_expected = len(f.expected) + len(f.expected_classification)
print(f"{f.name:<30} {f.mode:<6} {types:<25} {n_expected}")
if journey_fixtures:
print(f"\n{'[Journey Fixtures]'}")
print(f"{'Name':<30} {'Types':<25} {'Messages':<10} {'Criteria'}")
print("-" * 90)
for f in journey_fixtures:
types = ", ".join(f.data_types)
print(f"{f.name:<30} {types:<25} {len(f.user_messages):<10} {len(f.expected_template_criteria)}")
print()
def _cmd_sync(args: argparse.Namespace) -> None:
fixtures = discover_fixtures(_fixtures_dir(args.fixtures_dir))
journey_fixtures = discover_journey_fixtures(_fixtures_dir(args.fixtures_dir))
if args.fixture:
fixtures = [f for f in fixtures if f.name == args.fixture]
journey_fixtures = [f for f in journey_fixtures if f.name == args.fixture]
if not fixtures and not journey_fixtures:
print("No fixtures to sync.")
return
for fixture in fixtures:
name = langfuse_eval.sync_fixture_to_dataset(fixture)
if name:
print(f"Synced: {fixture.name}{name}")
else:
print(f"Skipped: {fixture.name} (Langfuse not configured)")
for fixture in journey_fixtures:
name = langfuse_eval.sync_journey_fixture_to_dataset(fixture)
if name:
print(f"Synced: {fixture.name}{name}")
else:
print(f"Skipped: {fixture.name} (Langfuse not configured)")
async def _cmd_interactive(args: argparse.Namespace) -> None:
journey_fixtures = discover_journey_fixtures(_fixtures_dir(args.fixtures_dir))
if not journey_fixtures:
print("No journey fixtures found. Create YAML files with type: journey in eval/fixtures/.")
return
if args.fixture:
fixtures = [f for f in journey_fixtures if f.name == args.fixture]
if not fixtures:
print(f"Journey fixture '{args.fixture}' not found.")
return
fixture = fixtures[0]
elif len(journey_fixtures) == 1:
fixture = journey_fixtures[0]
else:
# Let user pick
print("\nAvailable journey fixtures:")
for i, f in enumerate(journey_fixtures, 1):
print(f" {i}. {f.name}{f.description[:60]}")
print()
try:
choice = int(input("Pick a fixture number: ").strip()) - 1
fixture = journey_fixtures[choice]
except (ValueError, IndexError, EOFError, KeyboardInterrupt):
print("Invalid choice.")
return
await run_interactive(
fixture,
model=args.model,
judge_model=args.judge_model,
data_dir=Path(args.data_dir).resolve() if args.data_dir else None,
)
def main() -> None:
args = _parse_args()
_setup_logging(args.verbose)
if args.command == "run":
asyncio.run(_cmd_run(args))
elif args.command == "interactive":
asyncio.run(_cmd_interactive(args))
elif args.command == "list":
_cmd_list(args)
elif args.command == "sync":
_cmd_sync(args)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,219 @@
"""Eval configuration — YAML fixture loader and dataclasses.
Fixtures come in two families:
1. **Agent fixtures** — test the batch agent pipeline.
Three modes controlled by ``mode``:
``step1`` — classification prompt only.
``step2`` — processing prompt only.
``full`` — both steps in sequence.
2. **Journey fixtures** — test the prompt-template builder conversation
(unchanged).
"""
from __future__ import annotations
import logging
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Literal
import yaml
logger = logging.getLogger(__name__)
EvalMode = Literal["step1", "step2", "full"]
@dataclass
class ExpectedRecord:
"""A single expected extraction result.
Only the fields specified are checked — unspecified fields are ignored.
"""
table: str # tasks | notes | timelines | projects
fields: dict[str, Any] # field_name → expected_value
@dataclass
class ExpectedClassification:
"""Expected output of step-1 classification for one file."""
file: str # relative path to the sample file
project_id: str # expected matched project id, or "new"
domains: list[str] # expected domain list
new_project_name: str | None = None
@dataclass
class EvalFixture:
"""A complete test scenario loaded from YAML.
``mode`` determines which pipeline steps are exercised:
- **step1**: only ``_classify_file``
- **step2**: only the processing LLM + tool loop
- **full**: both steps in sequence (``run_local_agent``)
"""
name: str
description: str
mode: EvalMode
directory: str # relative path to sample files
data_types: list[str]
file_extensions: list[str]
models: list[str] # if empty, use CLI default
fixture_path: Path = field(default_factory=lambda: Path("."))
# ── Step-1 inputs (classification) ───────────────────────────
domain_definitions: str = ""
projects_list: list[dict[str, Any]] = field(default_factory=list)
# ── Step-2 inputs (processing) ───────────────────────────────
existing_context: str = ""
project_context: str = ""
custom_prompt_section: str = ""
# ── Seed records for mock executor ───────────────────────────
seed_records: dict[str, list[dict]] = field(default_factory=dict)
# ── Expected outputs ─────────────────────────────────────────
expected_classification: list[ExpectedClassification] = field(default_factory=list)
expected: list[ExpectedRecord] = field(default_factory=list)
@property
def fixture_dir(self) -> Path:
"""Absolute path to the sample files directory."""
return self.fixture_path.parent / self.directory
@classmethod
def from_yaml(cls, path: Path) -> "EvalFixture":
"""Load a fixture from a YAML file."""
raw = yaml.safe_load(path.read_text(encoding="utf-8"))
mode: EvalMode = raw.get("mode", "full")
# Parse expected records (step2/full)
expected: list[ExpectedRecord] = []
for table, records in (raw.get("expected") or {}).items():
for rec in records:
expected.append(ExpectedRecord(table=table, fields=rec))
# Parse expected classification (step1/full)
expected_classification: list[ExpectedClassification] = []
for item in raw.get("expected_classification") or []:
expected_classification.append(ExpectedClassification(
file=item["file"],
project_id=item["project_id"],
domains=item.get("domains", []),
new_project_name=item.get("new_project_name"),
))
return cls(
name=raw["name"],
description=raw.get("description", ""),
mode=mode,
directory=raw.get("directory", "sample_files"),
data_types=raw.get("data_types", ["tasks"]),
file_extensions=raw.get("file_extensions", []),
models=raw.get("models", []),
fixture_path=path,
# Step-1 inputs
domain_definitions=raw.get("domain_definitions", ""),
projects_list=raw.get("projects_list", []),
# Step-2 inputs
existing_context=raw.get("existing_context", ""),
project_context=raw.get("project_context", ""),
custom_prompt_section=raw.get("custom_prompt_section", ""),
# Shared
seed_records=raw.get("seed_records", {}),
expected_classification=expected_classification,
expected=expected,
)
def discover_fixtures(fixtures_dir: Path | None = None) -> list[EvalFixture]:
"""Find and load all YAML fixtures in the fixtures directory."""
if fixtures_dir is None:
fixtures_dir = Path(__file__).parent / "fixtures"
fixtures: list[EvalFixture] = []
if not fixtures_dir.is_dir():
logger.warning("eval: fixtures directory not found: %s", fixtures_dir)
return fixtures
for yaml_path in sorted(fixtures_dir.glob("*.yaml")):
try:
raw = yaml.safe_load(yaml_path.read_text(encoding="utf-8"))
if raw.get("type") == "journey":
continue # Skip journey fixtures
fixtures.append(EvalFixture.from_yaml(yaml_path))
logger.info("eval: loaded fixture %s from %s", fixtures[-1].name, yaml_path.name)
except Exception as exc:
logger.error("eval: failed to load fixture %s: %s", yaml_path.name, exc)
return fixtures
# ── Journey fixtures ─────────────────────────────────────────────────────
@dataclass
class JourneyFixture:
"""A journey test scenario — tests the prompt_template builder conversation."""
name: str
description: str
directory: str # relative path to sample files
data_types: list[str]
expected_template_criteria: list[str] # what the template should contain/satisfy
user_messages: list[str] = field(default_factory=list) # for automated journey runs (unused in interactive mode)
models: list[str] = field(default_factory=list)
fixture_path: Path = field(default_factory=lambda: Path("."))
@property
def fixture_dir(self) -> Path:
"""Absolute path to the sample files directory."""
return self.fixture_path.parent / self.directory
@classmethod
def from_yaml(cls, path: Path) -> "JourneyFixture":
"""Load a journey fixture from a YAML file."""
raw = yaml.safe_load(path.read_text(encoding="utf-8"))
return cls(
name=raw["name"],
description=raw.get("description", ""),
directory=raw.get("directory", "sample_files"),
data_types=raw.get("data_types", ["tasks"]),
user_messages=raw.get("user_messages", []),
expected_template_criteria=raw.get("expected_template_criteria", []),
models=raw.get("models", []),
fixture_path=path,
)
def discover_journey_fixtures(fixtures_dir: Path | None = None) -> list[JourneyFixture]:
"""Find and load all journey YAML fixtures in the fixtures directory."""
if fixtures_dir is None:
fixtures_dir = Path(__file__).parent / "fixtures"
fixtures: list[JourneyFixture] = []
if not fixtures_dir.is_dir():
logger.warning("eval: fixtures directory not found: %s", fixtures_dir)
return fixtures
for yaml_path in sorted(fixtures_dir.glob("*.yaml")):
try:
raw = yaml.safe_load(yaml_path.read_text(encoding="utf-8"))
if raw.get("type") != "journey":
continue
fixtures.append(JourneyFixture.from_yaml(yaml_path))
logger.info("eval: loaded journey fixture %s from %s", fixtures[-1].name, yaml_path.name)
except Exception as exc:
logger.error("eval: failed to load journey fixture %s: %s", yaml_path.name, exc)
return fixtures

View File

@@ -0,0 +1,40 @@
# Fixture: classify-invoices (step1)
# Tests _STEP1_SYSTEM_PROMPT — file classification and project matching.
# Verifies that the LLM correctly matches files to existing projects
# and identifies the right data domains.
name: classify-invoices
mode: step1
description: >
Test file classification on Italian freelance invoices and meeting notes.
Verifies project matching and domain identification.
directory: sample_files/invoices
data_types: [tasks, notes, timelines]
file_extensions: [txt, md]
# ── Step-1 prompt variables ──────────────────────────────────────
domain_definitions: |
- tasks: Action items, deliverables, things to do — anything that someone needs to complete.
- notes: Meeting summaries, decisions, reference information — permanent knowledge entries.
- timelines: Project milestones, deadlines, scheduled events — specific dates that mark a point in the progress of a project.
projects_list:
- id: "proj-web-redesign"
name: "Redesign Sito Web Corporate"
status: "active"
aiSummary: "Corporate website redesign for Studio Architettura Bianchi"
- id: "proj-ecommerce"
name: "E-Commerce FashionStore"
status: "active"
aiSummary: "Next.js e-commerce platform for FashionStore srl"
# ── Expected classification results ─────────────────────────────
expected_classification:
- file: "sample_files/invoices/fattura_042.txt"
project_id: "proj-web-redesign"
domains: [tasks, notes, timelines]
- file: "sample_files/invoices/meeting_ecommerce.md"
project_id: "proj-ecommerce"
domains: [tasks, notes, timelines]

View File

@@ -0,0 +1,108 @@
# Fixture: full-invoices (full)
# Tests both _STEP1_SYSTEM_PROMPT and _PROCESSING_SYSTEM_PROMPT in sequence
# via run_local_agent(). Verifies end-to-end classification + extraction.
name: full-invoices
mode: full
description: >
End-to-end test: classify Italian invoices/meeting notes into the
correct project, then extract tasks, notes, and timeline events.
directory: sample_files/invoices
data_types: [tasks, notes, timelines]
file_extensions: [txt, md]
# ── Step-1 prompt variables ──────────────────────────────────────
domain_definitions: |
- tasks: Action items, deliverables, things to do — anything that someone needs to complete.
- notes: Meeting summaries, decisions, reference information — permanent knowledge entries.
- timelines: Project milestones, deadlines, scheduled events — specific dates that mark a point in the progress of a project.
projects_list:
- id: "proj-web-redesign"
name: "Redesign Sito Web Corporate"
status: "active"
aiSummary: "Corporate website redesign for Studio Architettura Bianchi"
- id: "proj-ecommerce"
name: "E-Commerce FashionStore"
status: "active"
aiSummary: "Next.js e-commerce platform for FashionStore srl"
# ── Step-2 prompt variables ──────────────────────────────────────
existing_context: |
Existing tasks:
(none)
Existing notes:
(none)
Existing timelines:
(none)
project_context: ""
custom_prompt_section: |
User instructions:
Estrai i dati dai file come segue:
- TASK: ogni azione da fare, deliverable, o item con scadenza.
Mappa "URGENTE" o "ALTA PRIORITÀ" → priority: high.
Mappa "media priorità" → priority: medium.
Mappa "bassa priorità" → priority: low.
Se un item è marcato come "completato" o [x], impostalo status: done.
Altrimenti status: todo.
- NOTE: riassunti di meeting, decisioni prese, note tecniche.
- TIMELINE: date di scadenza, milestone, meeting futuri.
Imposta sempre isAiSuggested=1.
# ── Seed records (pre-existing DB state) ─────────────────────────
seed_records:
projects:
- id: "proj-web-redesign"
name: "Redesign Sito Web Corporate"
status: "active"
aiSummary: "Corporate website redesign for Studio Architettura Bianchi"
- id: "proj-ecommerce"
name: "E-Commerce FashionStore"
status: "active"
aiSummary: "Next.js e-commerce platform for FashionStore srl"
tasks: []
notes: []
timelines: []
# ── Expected classification (step 1) ─────────────────────────────
expected_classification:
- file: "sample_files/invoices/fattura_042.txt"
project_id: "proj-web-redesign"
domains: [tasks, notes, timelines]
- file: "sample_files/invoices/meeting_ecommerce.md"
project_id: "proj-ecommerce"
domains: [tasks, notes, timelines]
# ── Expected extractions (step 2) ────────────────────────────────
expected:
tasks:
- title: "Sviluppo frontend React"
priority: "high"
status: "todo"
- title: "Integrazione API backend"
priority: "medium"
status: "todo"
- title: "Testing cross-browser e fix bug responsive"
status: "todo"
- title: "Preparare wireframe homepage"
priority: "high"
status: "todo"
- title: "Setup progetto Next.js e configurare CI/CD"
priority: "medium"
status: "todo"
- title: "Ricerca plugin Stripe per gestione abbonamenti"
priority: "low"
status: "todo"
notes:
- title: "Meeting Kickoff Progetto E-Commerce"
timelines:
- title: "MVP E-Commerce pronto"
- title: "Meeting di revisione"

View File

@@ -0,0 +1,28 @@
# Journey Fixture: journey-invoice-setup
# Used by `python -m eval interactive` for human-in-the-loop testing
# of the journey chatbot's prompt-building conversation.
type: journey
name: journey-invoice-setup
description: >
Interactive test for the journey chatbot — explore a directory of
Italian invoices and meeting notes, answer the chatbot's questions,
and verify it produces a well-structured prompt_template for data
extraction.
directory: sample_files/invoices
data_types: [tasks, notes, timelines, projects]
# Criteria the generated prompt_template must satisfy
# Each is scored 0-1 by an LLM judge
expected_template_criteria:
- "Mentions creating tasks from action items and work descriptions"
- "Mentions creating notes from meeting summaries"
- "Mentions extracting timeline events from deadlines and meeting dates"
- "Mentions creating projects from relevant information"
- "Sets isAiSuggested=1 on all created records"
- "Does NOT include projectId assignment logic"
- "Uses camelCase field names (title, status, priority, dueDate, content)"
# Models to test (empty = use CLI --models default)
models: []

View File

@@ -0,0 +1,81 @@
# Fixture: process-invoices (step2)
# Tests _PROCESSING_SYSTEM_PROMPT — data extraction & tool calling.
# The classification step is skipped; prompt variables are injected directly.
name: process-invoices
mode: step2
description: >
Test data extraction from Italian freelance invoices.
Verifies correct record creation via tool calls with the right
fields, priorities, and status values.
directory: sample_files/invoices
data_types: [tasks, notes, timelines]
file_extensions: [txt, md]
# ── Step-2 prompt variables ──────────────────────────────────────
existing_context: |
Existing tasks:
(none)
Existing notes:
(none)
Existing timelines:
(none)
project_context: >
Project: Redesign Sito Web Corporate (id: proj-web-redesign).
Always set projectId to this id on every record you create.
custom_prompt_section: |
User instructions:
Estrai i dati dai file come segue:
- TASK: ogni azione da fare, deliverable, o item con scadenza.
Mappa "URGENTE" o "ALTA PRIORITÀ" → priority: high.
Mappa "media priorità" → priority: medium.
Mappa "bassa priorità" → priority: low.
Se un item è marcato come "completato" o [x], impostalo status: done.
Altrimenti status: todo.
- NOTE: riassunti di meeting, decisioni prese, note tecniche.
Il titolo deve essere descrittivo. Il content deve includere tutti i dettagli.
- TIMELINE: date di scadenza, milestone, meeting futuri.
Imposta sempre isAiSuggested=1.
# ── Seed records (pre-existing DB state) ─────────────────────────
seed_records:
projects:
- id: "proj-web-redesign"
name: "Redesign Sito Web Corporate"
status: "active"
tasks: []
notes: []
timelines: []
# ── Expected extractions ─────────────────────────────────────────
expected:
tasks:
- title: "Sviluppo frontend React"
priority: "high"
status: "todo"
- title: "Integrazione API backend"
priority: "medium"
status: "todo"
- title: "Testing cross-browser e fix bug responsive"
status: "todo"
- title: "Preparare wireframe homepage"
priority: "high"
status: "todo"
- title: "Setup progetto Next.js e configurare CI/CD"
priority: "medium"
status: "todo"
- title: "Ricerca plugin Stripe per gestione abbonamenti"
priority: "low"
status: "todo"
notes:
- title: "Meeting Kickoff Progetto E-Commerce"
timelines:
- title: "MVP E-Commerce pronto"
- title: "Meeting di revisione"

View File

@@ -0,0 +1,18 @@
FATTURA N. 2026-0042
Data: 15 Marzo 2026
Cliente: Studio Architettura Bianchi
Progetto: Redesign Sito Web Corporate
Descrizione lavori:
- Sviluppo frontend React (40 ore) — URGENTE, completare entro 20 marzo
- Integrazione API backend (20 ore) — priorità media
- Design UI/UX mockup homepage (8 ore) — completato
- Testing cross-browser e fix bug responsive (12 ore) — da iniziare
Totale: €4.800,00 + IVA
Note:
Meeting di revisione previsto per il 18 marzo alle 10:00.
Il cliente ha richiesto modifiche al layout mobile della sezione contatti.
Attendere conferma budget aggiuntivo per sezione blog.

View File

@@ -0,0 +1,25 @@
# Meeting Notes - Kickoff Progetto E-Commerce
**Data:** 10 Marzo 2026
**Partecipanti:** Marco R., Giulia T., Cliente (FashionStore srl)
## Decisioni prese
1. **Piattaforma**: Next.js + Stripe per i pagamenti
2. **Timeline**: MVP pronto entro 30 aprile 2026
3. **Budget**: €12.000 totale, €4.000 anticipo già ricevuto
## Action items
- [ ] Marco: preparare wireframe homepage entro 14 marzo — ALTA PRIORITÀ
- [ ] Giulia: setup progetto Next.js e configurare CI/CD — media priorità
- [ ] Marco: ricerca plugin Stripe per gestione abbonamenti — bassa priorità
- [x] Giulia: inviare contratto firmato al cliente — COMPLETATO
## Note aggiuntive
Il cliente vuole un design minimalista, ispirato a Zara.com.
Colori primari: nero, bianco, oro.
Font: Inter per body, Playfair Display per headings.
Prossimo meeting: 24 marzo 2026 ore 15:00.

View File

@@ -0,0 +1,471 @@
"""Interactive journey session — human-in-the-loop CLI conversation.
Flow:
1. Show the system prompt used by the journey AI.
2. Start the journey (AI explores files, asks first question).
3. User types responses in the terminal — AI replies.
4. User types `/done` to end the conversation.
5. User writes a comment about the interaction quality.
6. LLM judge scores the conversation + generated template.
7. Results are reported to Langfuse.
Usage::
python -m eval interactive # pick a fixture interactively
python -m eval interactive --fixture=journey-invoice-setup
python -m eval interactive --model=gpt-4o
python -m eval interactive --judge-model=github_copilot/gpt-4o-mini
"""
from __future__ import annotations
import asyncio
import json
import logging
import sys
import time
import uuid
from dataclasses import dataclass, field
from typing import Any
from langchain_core.messages import HumanMessage, SystemMessage
from eval.config import JourneyFixture, discover_journey_fixtures
from eval.mock_executor import MockExecutor
from eval import langfuse_eval
logger = logging.getLogger(__name__)
# ── Special commands ─────────────────────────────────────────────────────
_CMD_DONE = "/done"
_CMD_QUIT = "/quit"
_CMD_TEMPLATE = "/template"
_CMD_HELP = "/help"
_HELP_TEXT = f"""\
{_CMD_DONE} — End the conversation and proceed to evaluation
{_CMD_QUIT} — Abort without evaluation
{_CMD_TEMPLATE} — Show the generated template (if any)
{_CMD_HELP} — Show this help"""
# ── Terminal colours (ANSI) ──────────────────────────────────────────────
_C_RESET = "\033[0m"
_C_BOLD = "\033[1m"
_C_DIM = "\033[2m"
_C_CYAN = "\033[36m"
_C_GREEN = "\033[32m"
_C_YELLOW = "\033[33m"
_C_MAGENTA = "\033[35m"
_C_RED = "\033[31m"
_C_BLUE = "\033[34m"
def _print_header(text: str) -> None:
print(f"\n{_C_BOLD}{_C_CYAN}{'' * 80}")
print(f" {text}")
print(f"{'' * 80}{_C_RESET}\n")
def _print_ai(text: str) -> None:
print(f"\n{_C_GREEN}{_C_BOLD}AI:{_C_RESET} {text}\n")
def _print_system(text: str) -> None:
print(f"{_C_DIM}{text}{_C_RESET}")
def _print_score(label: str, score: float) -> None:
if score >= 0.7:
color = _C_GREEN
tag = "PASS"
elif score >= 0.4:
color = _C_YELLOW
tag = "PARTIAL"
else:
color = _C_RED
tag = "FAIL"
print(f" {color}{tag:>7}{_C_RESET} ({score:.1f}) {label}")
# ── Result type ──────────────────────────────────────────────────────────
@dataclass
class InteractiveResult:
fixture_name: str
model: str
judge_model: str
prompt_template: str | None
conversation: list[dict[str, str]]
user_comment: str
done: bool
criteria_scores: dict[str, float]
overall_score: float
judge_reasoning: str
elapsed_seconds: float
def summary(self) -> dict[str, Any]:
return {
"fixture": self.fixture_name,
"model": self.model,
"judge_model": self.judge_model,
"done": self.done,
"turns": len([c for c in self.conversation if c["role"] == "user"]),
"overall_score": round(self.overall_score, 3),
"user_comment": self.user_comment,
"criteria_scores": {k: round(v, 3) for k, v in self.criteria_scores.items()},
"elapsed_s": round(self.elapsed_seconds, 1),
}
# ── LLM judge ────────────────────────────────────────────────────────────
_INTERACTIVE_JUDGE_SYSTEM = """\
You are an evaluation judge for AI-generated prompt templates produced during
an interactive conversation between a human and a journey chatbot.
The chatbot explored a directory and through multi-turn conversation with the
user produced a prompt_template — an instruction set for a data-extraction agent.
You have access to:
- The full conversation transcript
- The generated prompt_template (if any)
- The user's own comment about the interaction
- A list of quality criteria
Score each criterion from 0 to 1:
- 1.0: Fully satisfied
- 0.5: Partially satisfied
- 0.0: Not satisfied
Also provide an overall_quality score (0-1) evaluating the conversation flow,
how well the AI understood the user, and the template quality.
Respond with ONLY a JSON object:
{
"criteria_scores": {"criterion_1": 0.8, ...},
"overall_quality": 0.85,
"reasoning": "Brief explanation covering both conversation quality and template accuracy"
}
"""
async def _judge_interactive(
conversation: list[dict[str, str]],
prompt_template: str | None,
user_comment: str,
criteria: list[str],
*,
judge_model: str = "gpt-4o-mini",
) -> tuple[dict[str, float], float, str]:
"""Score an interactive session. Returns (criteria_scores, overall_quality, reasoning)."""
from shared.llm import get_llm
llm = get_llm(model=judge_model, temperature=0)
conv_text = "\n".join(
f"{'USER' if t['role'] == 'user' else 'AI'}: {t['content']}"
for t in conversation
)
criteria_text = "\n".join(f" {i+1}. {c}" for i, c in enumerate(criteria))
user_content = (
f"## Conversation transcript\n```\n{conv_text}\n```\n\n"
f"## Generated prompt_template\n```\n{prompt_template or '(none — conversation did not complete)'}\n```\n\n"
f"## User's comment\n{user_comment}\n\n"
f"## Criteria to evaluate\n{criteria_text}"
)
try:
response = await llm.ainvoke([
SystemMessage(content=_INTERACTIVE_JUDGE_SYSTEM),
HumanMessage(content=user_content),
])
raw = response.content.strip()
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
parsed = json.loads(raw.strip())
scores_raw = parsed.get("criteria_scores", parsed.get("scores", {}))
criteria_scores: dict[str, float] = {}
for i, criterion in enumerate(criteria):
key_candidates = [f"criterion_{i+1}", criterion, criterion[:50], str(i + 1)]
score = 0.0
for key in key_candidates:
if key in scores_raw:
score = float(scores_raw[key])
break
if score == 0.0 and i < len(scores_raw):
score = float(list(scores_raw.values())[i])
criteria_scores[criterion] = score
overall = float(parsed.get("overall_quality", 0.0))
reasoning = str(parsed.get("reasoning", ""))
return criteria_scores, overall, reasoning
except Exception as exc:
logger.warning("interactive judge failed: %s", exc)
return {c: 0.0 for c in criteria}, 0.0, f"Judge error: {exc}"
# ── Interactive session ──────────────────────────────────────────────────
async def run_interactive(
fixture: JourneyFixture,
*,
model: str = "gpt-4o",
judge_model: str = "gpt-4o-mini",
data_dir: Path | None = None,
) -> InteractiveResult:
"""Run an interactive journey session in the terminal.
Parameters
----------
data_dir :
If set, overrides the fixture's sample-file directory. The LLM
will explore this folder instead of the default
``fixtures/sample_files/…``. Useful for private test data that
shouldn't be committed to git.
"""
from shared.config import settings
from shared.ws_context import set_current_user, clear_current_user
from app.journey import (
handle_journey_start,
handle_journey_message,
_build_system_prompt,
)
# When --data-dir is given, the MockExecutor's root becomes
# data_dir's parent and the journey directory is data_dir's name.
# This way the LLM sees a meaningful directory name (not ".") and
# MockExecutor resolves paths correctly.
# Otherwise, use the fixture's YAML parent and its relative path.
if data_dir:
mock_root = data_dir.parent
journey_directory = data_dir.name
else:
mock_root = fixture.fixture_path.parent
journey_directory = fixture.directory
mock = MockExecutor(
fixture_dir=mock_root,
seed_records={},
)
original_model = settings.LLM_MODEL
settings.LLM_MODEL = model
eval_user_id = f"interactive-{uuid.uuid4().hex[:8]}"
# ── Show system prompt ───────────────────────────────────────
system_prompt = _build_system_prompt(journey_directory, fixture.data_types)
_print_header("SYSTEM PROMPT")
print(f"{_C_DIM}{system_prompt}{_C_RESET}")
_print_header(f"INTERACTIVE JOURNEY | fixture: {fixture.name} | model: {model}")
print(f" Data dir: {mock_root}")
print(f" Type your responses. Commands: {_CMD_DONE}, {_CMD_QUIT}, {_CMD_TEMPLATE}, {_CMD_HELP}")
print(f" Judge model: {judge_model}")
print(f" Criteria: {len(fixture.expected_template_criteria)}")
print()
conversation: list[dict[str, str]] = []
prompt_template: str | None = None
done = False
start_time = time.time()
try:
set_current_user(eval_user_id)
with mock.patch():
# ── Start ────────────────────────────────────────────
_print_system("Starting journey... (AI is exploring your files)")
start_frame: dict[str, Any] = {
"agent_type": "local",
"directory": journey_directory,
"data_types": fixture.data_types,
"session_id": f"interactive-{uuid.uuid4().hex[:8]}",
}
reply = await handle_journey_start(eval_user_id, start_frame)
session_id = reply["session_id"]
conversation.append({"role": "assistant", "content": reply["message"]})
_print_ai(reply["message"])
if reply["done"]:
prompt_template = reply.get("prompt_template")
done = True
_print_system("Journey completed on first reply (template generated).")
# ── Conversation loop ────────────────────────────────
while not done:
try:
user_input = input(f"{_C_BOLD}{_C_BLUE}YOU:{_C_RESET} ").strip()
except (EOFError, KeyboardInterrupt):
print()
user_input = _CMD_QUIT
if not user_input:
continue
# Handle commands
if user_input.lower() == _CMD_QUIT:
_print_system("Aborted — no evaluation will be performed.")
settings.LLM_MODEL = original_model
clear_current_user()
return InteractiveResult(
fixture_name=fixture.name, model=model, judge_model=judge_model,
prompt_template=None, conversation=conversation,
user_comment="(aborted)", done=False,
criteria_scores={}, overall_score=0.0,
judge_reasoning="Session aborted by user.",
elapsed_seconds=time.time() - start_time,
)
if user_input.lower() == _CMD_HELP:
print(_HELP_TEXT)
continue
if user_input.lower() == _CMD_TEMPLATE:
if prompt_template:
print(f"\n{_C_MAGENTA}{prompt_template}{_C_RESET}\n")
else:
_print_system("No template generated yet.")
continue
if user_input.lower() == _CMD_DONE:
_print_system("Ending conversation...")
break
# ── Send message to AI ───────────────────────────
conversation.append({"role": "user", "content": user_input})
_print_system("AI is thinking...")
msg_frame: dict[str, Any] = {
"session_id": session_id,
"message": user_input,
}
reply = await handle_journey_message(eval_user_id, msg_frame)
conversation.append({"role": "assistant", "content": reply["message"]})
_print_ai(reply["message"])
if reply["done"]:
prompt_template = reply.get("prompt_template")
done = True
_print_system("Journey completed — template generated!")
except Exception as exc:
logger.error("interactive journey failed: %s", exc)
_print_system(f"Error: {exc}")
finally:
settings.LLM_MODEL = original_model
clear_current_user()
elapsed = time.time() - start_time
turns = len([c for c in conversation if c["role"] == "user"])
# ── Show template if generated ───────────────────────────────
if prompt_template:
_print_header("GENERATED TEMPLATE")
print(f"{_C_MAGENTA}{prompt_template}{_C_RESET}\n")
else:
_print_system("No template was generated during this session.")
# ── User comment ─────────────────────────────────────────────
_print_header("YOUR EVALUATION")
print(" Write your comment about this interaction (press Enter twice to finish):")
print()
comment_lines: list[str] = []
try:
while True:
line = input()
if line == "" and comment_lines and comment_lines[-1] == "":
comment_lines.pop() # remove trailing empty
break
comment_lines.append(line)
except (EOFError, KeyboardInterrupt):
pass
user_comment = "\n".join(comment_lines).strip() or "(no comment)"
# ── Judge ────────────────────────────────────────────────────
_print_header("LLM JUDGE EVALUATION")
_print_system(f"Scoring with {judge_model}...")
criteria_scores, overall_quality, judge_reasoning = await _judge_interactive(
conversation=conversation,
prompt_template=prompt_template,
user_comment=user_comment,
criteria=fixture.expected_template_criteria,
judge_model=judge_model,
)
# ── Display scores ───────────────────────────────────────────
print()
for criterion, score in criteria_scores.items():
_print_score(criterion, score)
overall = (
sum(criteria_scores.values()) / len(criteria_scores)
if criteria_scores
else 0.0
)
print(f"\n {_C_BOLD}Criteria avg: {overall:.2f}{_C_RESET}")
print(f" {_C_BOLD}Overall quality: {overall_quality:.2f}{_C_RESET}")
print(f" {_C_BOLD}Turns: {turns}{_C_RESET}")
print(f" {_C_BOLD}Time: {elapsed:.1f}s{_C_RESET}")
print(f"\n {_C_DIM}Judge: {judge_reasoning}{_C_RESET}")
print(f" {_C_DIM}Your comment: {user_comment}{_C_RESET}\n")
result = InteractiveResult(
fixture_name=fixture.name,
model=model,
judge_model=judge_model,
prompt_template=prompt_template,
conversation=conversation,
user_comment=user_comment,
done=done,
criteria_scores=criteria_scores,
overall_score=overall_quality,
judge_reasoning=judge_reasoning,
elapsed_seconds=elapsed,
)
# ── Report to Langfuse ───────────────────────────────────────
trace_id = langfuse_eval.log_eval_trace(
fixture_name=fixture.name,
model=model,
prompt_variant="interactive",
prompt_template=prompt_template or "(not generated)",
actual_mutations=[{
"conversation": conversation[:30],
"user_comment": user_comment,
}],
scores_summary=result.summary(),
langfuse_prompt_names=["journey_system"],
)
if trace_id:
from eval.scorer import EvalScores
scores_obj = EvalScores(
fixture_name=fixture.name,
model=model,
prompt_variant="interactive",
precision=overall,
recall=float(done),
f1=overall,
llm_judge_score=overall_quality,
llm_judge_reasoning=judge_reasoning,
)
langfuse_eval.post_eval_scores(scores_obj, trace_id=trace_id)
_print_system(f"Results reported to Langfuse (trace: {trace_id})")
else:
_print_system("Langfuse not configured — results not reported.")
return result

View File

@@ -0,0 +1,385 @@
"""Journey eval runner — tests the prompt_template builder conversation.
For each (journey_fixture × model) combination:
1. Build a MockExecutor (for filesystem tools used during journey)
2. Patch execute_on_client
3. Override LLM_MODEL
4. Call handle_journey_start to kick off the conversation
5. Feed simulated user_messages via handle_journey_message
6. Collect the generated prompt_template
7. Score it against expected_template_criteria (via LLM judge)
8. Report to Langfuse
"""
from __future__ import annotations
import asyncio
import copy
import json
import logging
import time
import uuid
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from langchain_core.messages import HumanMessage, SystemMessage
from eval.config import JourneyFixture
from eval.mock_executor import MockExecutor
from eval import langfuse_eval
logger = logging.getLogger(__name__)
# ── Result type ──────────────────────────────────────────────────────────
@dataclass
class JourneyEvalResult:
"""Result of one journey eval run."""
fixture_name: str
model: str
prompt_template: str | None # the generated template (None if journey failed)
conversation_turns: int
done: bool # whether journey reached completion
criteria_scores: dict[str, float] # criterion → 0-1 score
overall_score: float # average of criteria scores
judge_reasoning: str
elapsed_seconds: float
def summary(self) -> dict[str, Any]:
return {
"fixture": self.fixture_name,
"model": self.model,
"done": self.done,
"turns": self.conversation_turns,
"overall_score": round(self.overall_score, 3),
"criteria_scores": {k: round(v, 3) for k, v in self.criteria_scores.items()},
"elapsed_s": round(self.elapsed_seconds, 1),
}
# ── LLM judge for template quality ──────────────────────────────────────
_JOURNEY_JUDGE_SYSTEM = """\
You are an evaluation judge for AI-generated prompt templates.
A journey chatbot explored a user's directory structure and through
conversation produced a prompt_template — an instruction set for a
data-extraction agent.
Your task: evaluate the generated template against a list of criteria.
Score each criterion from 0 to 1:
- 1.0: Fully satisfied, clearly present in the template
- 0.5: Partially satisfied or ambiguously addressed
- 0.0: Not satisfied, missing from the template
Respond with ONLY a JSON object:
{
"scores": {"criterion_1": 0.8, "criterion_2": 1.0, ...},
"reasoning": "Brief explanation"
}
"""
async def _judge_template(
prompt_template: str,
criteria: list[str],
*,
judge_model: str = "gpt-4o-mini",
) -> tuple[dict[str, float], str]:
"""Use an LLM to evaluate a generated prompt_template against criteria.
Returns (criteria_scores, reasoning).
"""
from shared.llm import get_llm
llm = get_llm(model=judge_model, temperature=0)
criteria_text = "\n".join(f" {i+1}. {c}" for i, c in enumerate(criteria))
user_content = (
f"## Generated prompt_template\n```\n{prompt_template}\n```\n\n"
f"## Criteria to evaluate\n{criteria_text}"
)
try:
response = await llm.ainvoke([
SystemMessage(content=_JOURNEY_JUDGE_SYSTEM),
HumanMessage(content=user_content),
])
raw = response.content.strip()
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
parsed = json.loads(raw.strip())
scores_raw = parsed.get("scores", {})
# Map criterion keys back to the original criteria text
criteria_scores: dict[str, float] = {}
for i, criterion in enumerate(criteria):
# Try matching by index key or exact criterion text
key_candidates = [
f"criterion_{i+1}",
criterion,
criterion[:50],
str(i + 1),
]
score = 0.0
for key in key_candidates:
if key in scores_raw:
score = float(scores_raw[key])
break
# If no match found, try values in order
if score == 0.0 and i < len(scores_raw):
score = float(list(scores_raw.values())[i])
criteria_scores[criterion] = score
reasoning = str(parsed.get("reasoning", ""))
return criteria_scores, reasoning
except Exception as exc:
logger.warning("journey_eval: LLM judge failed: %s", exc)
return {c: 0.0 for c in criteria}, f"Judge error: {exc}"
# ── Journey runner ───────────────────────────────────────────────────────
async def run_single_journey_eval(
fixture: JourneyFixture,
model: str,
*,
judge_model: str = "gpt-4o-mini",
data_dir: Path | None = None,
) -> JourneyEvalResult:
"""Execute one journey eval: start \u2192 messages \u2192 score template."""
from shared.config import settings
# When data_dir is given, use its parent as MockExecutor root
# and its name as the journey directory so the LLM sees a
# meaningful path (not ".").
if data_dir:
mock_root = data_dir.parent
journey_directory = data_dir.name
else:
mock_root = fixture.fixture_path.parent
journey_directory = fixture.directory
mock = MockExecutor(
fixture_dir=mock_root,
seed_records={},
)
original_model = settings.LLM_MODEL
settings.LLM_MODEL = model
eval_user_id = f"eval-journey-{uuid.uuid4().hex[:8]}"
logger.info(
"journey_eval: starting %s | model=%s",
fixture.name, model,
)
start_time = time.time()
prompt_template: str | None = None
conversation: list[dict[str, str]] = []
done = False
try:
from shared.ws_context import set_current_user, clear_current_user
from app.journey import handle_journey_start, handle_journey_message, _sessions
set_current_user(eval_user_id)
with mock.patch():
# ── Start the journey ────────────────────────────────
start_frame: dict[str, Any] = {
"agent_type": "local",
"directory": journey_directory,
"data_types": fixture.data_types,
"session_id": f"eval-{uuid.uuid4().hex[:8]}",
}
reply = await handle_journey_start(eval_user_id, start_frame)
session_id = reply["session_id"]
conversation.append({"role": "assistant", "content": reply["message"]})
logger.info(
"journey_eval: start reply (%d chars), done=%s",
len(reply["message"]), reply["done"],
)
if reply["done"]:
prompt_template = reply.get("prompt_template")
done = True
else:
# ── Send user messages ───────────────────────────
for i, user_msg in enumerate(fixture.user_messages):
if done:
break
conversation.append({"role": "user", "content": user_msg})
msg_frame: dict[str, Any] = {
"session_id": session_id,
"message": user_msg,
}
reply = await handle_journey_message(eval_user_id, msg_frame)
conversation.append({"role": "assistant", "content": reply["message"]})
logger.info(
"journey_eval: turn %d reply (%d chars), done=%s",
i + 1, len(reply["message"]), reply["done"],
)
if reply["done"]:
prompt_template = reply.get("prompt_template")
done = True
# If not done after all user messages, send a final nudge
if not done:
nudge = "Please generate the final prompt_template now. I'm satisfied with the configuration."
conversation.append({"role": "user", "content": nudge})
nudge_frame: dict[str, Any] = {
"session_id": session_id,
"message": nudge,
}
reply = await handle_journey_message(eval_user_id, nudge_frame)
conversation.append({"role": "assistant", "content": reply["message"]})
if reply["done"]:
prompt_template = reply.get("prompt_template")
done = True
except Exception as exc:
logger.error("journey_eval: pipeline failed for %s/%s: %s", fixture.name, model, exc)
finally:
settings.LLM_MODEL = original_model
from shared.ws_context import clear_current_user
clear_current_user()
elapsed = time.time() - start_time
turns = len([c for c in conversation if c["role"] == "user"])
logger.info(
"journey_eval: completed in %.1fs — %d turns, done=%s, template=%s",
elapsed, turns, done, "yes" if prompt_template else "no",
)
# ── Score the template ───────────────────────────────────────
criteria_scores: dict[str, float] = {}
judge_reasoning = ""
if prompt_template and fixture.expected_template_criteria:
criteria_scores, judge_reasoning = await _judge_template(
prompt_template,
fixture.expected_template_criteria,
judge_model=judge_model,
)
elif not prompt_template:
criteria_scores = {c: 0.0 for c in fixture.expected_template_criteria}
judge_reasoning = "No prompt_template was generated — journey did not complete."
overall = (
sum(criteria_scores.values()) / len(criteria_scores)
if criteria_scores
else 0.0
)
result = JourneyEvalResult(
fixture_name=fixture.name,
model=model,
prompt_template=prompt_template,
conversation_turns=turns,
done=done,
criteria_scores=criteria_scores,
overall_score=overall,
judge_reasoning=judge_reasoning,
elapsed_seconds=elapsed,
)
# ── Report to Langfuse ───────────────────────────────────────
trace_id = langfuse_eval.log_eval_trace(
fixture_name=fixture.name,
model=model,
prompt_variant="journey",
prompt_template=prompt_template or "(not generated)",
actual_mutations=[{"conversation": conversation[:20]}],
scores_summary=result.summary(),
langfuse_prompt_names=["journey_system"],
)
if trace_id:
from eval.scorer import EvalScores
scores_obj = EvalScores(
fixture_name=fixture.name,
model=model,
prompt_variant="journey",
precision=overall,
recall=float(done),
f1=overall,
llm_judge_score=overall,
llm_judge_reasoning=judge_reasoning,
)
langfuse_eval.post_eval_scores(scores_obj, trace_id=trace_id)
return result
async def run_journey_fixture_eval(
fixture: JourneyFixture,
models: list[str],
*,
judge_model: str = "gpt-4o-mini",
data_dir: Path | None = None,
) -> list[JourneyEvalResult]:
"""Run all models for a journey fixture."""
langfuse_eval.sync_journey_fixture_to_dataset(fixture)
results: list[JourneyEvalResult] = []
for model in models:
result = await run_single_journey_eval(
fixture, model, judge_model=judge_model,
data_dir=data_dir,
)
results.append(result)
return results
def print_journey_results(results: list[JourneyEvalResult]) -> None:
"""Print a formatted summary of journey eval results."""
if not results:
print("\nNo journey eval results.")
return
print("\n" + "=" * 95)
print(f"{'Fixture':<25} {'Model':<25} {'Done':>5} {'Turns':>6} {'Score':>7} {'Time':>7}")
print("-" * 95)
for r in results:
done_str = "yes" if r.done else "NO"
print(
f"{r.fixture_name:<25} {r.model:<25} {done_str:>5} "
f"{r.conversation_turns:>6} {r.overall_score:>7.2f} {r.elapsed_seconds:>6.1f}s"
)
print("=" * 95)
# Criteria breakdown
for r in results:
if r.criteria_scores:
print(f"\n[{r.model}] Criteria scores:")
for criterion, score in r.criteria_scores.items():
indicator = "PASS" if score >= 0.7 else "PARTIAL" if score >= 0.4 else "FAIL"
print(f" {indicator:>7} ({score:.1f}) {criterion}")
if r.judge_reasoning:
print(f" Judge: {r.judge_reasoning}")
if r.prompt_template:
preview = r.prompt_template[:200].replace("\n", " ")
print(f" Template preview: {preview}...")
print()

View File

@@ -0,0 +1,327 @@
"""Langfuse evaluation integration — datasets, runs, and scoring.
Uses the Langfuse Python SDK v4 (OpenTelemetry-based) to:
1. **Sync fixtures → Langfuse datasets**: Each YAML fixture becomes a dataset,
each prompt variant + expected pair becomes a dataset item.
2. **Track eval runs**: Each (fixture × model × prompt_variant) execution
is recorded as a trace with linked scores.
3. **Post scores**: precision, recall, F1, field_accuracy, llm_judge are
posted as numeric scores on the trace.
"""
from __future__ import annotations
import logging
import os
from typing import Any
from shared.config import settings
from eval.config import EvalFixture
from eval.scorer import EvalScores
logger = logging.getLogger(__name__)
def _get_langfuse():
"""Get or create a Langfuse client instance (SDK v4)."""
if not settings.LANGFUSE_SECRET_KEY or not settings.LANGFUSE_PUBLIC_KEY:
return None
try:
os.environ.setdefault("LANGFUSE_SECRET_KEY", settings.LANGFUSE_SECRET_KEY)
os.environ.setdefault("LANGFUSE_PUBLIC_KEY", settings.LANGFUSE_PUBLIC_KEY)
if settings.LANGFUSE_HOST:
os.environ.setdefault("LANGFUSE_HOST", settings.LANGFUSE_HOST)
from langfuse import get_client
return get_client()
except Exception as exc:
logger.warning("langfuse_eval: failed to create client: %s", exc)
return None
def sync_fixture_to_dataset(fixture: EvalFixture) -> str | None:
"""Create or update a Langfuse dataset from a fixture.
Each prompt variant becomes a separate dataset item with:
- input: {directory, data_types, prompt_template, seed_records}
- expected_output: {expected records}
Returns the dataset name, or None if Langfuse is unavailable.
"""
lf = _get_langfuse()
if lf is None:
logger.info("langfuse_eval: Langfuse not configured — skipping dataset sync")
return None
dataset_name = f"batch-eval-{fixture.name}"
try:
lf.create_dataset(
name=dataset_name,
description=fixture.description,
metadata={
"data_types": ",".join(fixture.data_types),
"file_extensions": ",".join(fixture.file_extensions) if fixture.file_extensions else "",
},
)
except Exception:
# Dataset may already exist — that's fine
pass
# Build expected_output appropriate to the fixture's mode
expected_output: dict[str, Any] = {}
if fixture.mode in ("step1", "full") and fixture.expected_classification:
expected_output["classifications"] = [
{"file": ec.file, "project_id": ec.project_id, "domains": ec.domains}
for ec in fixture.expected_classification
]
if fixture.mode in ("step2", "full") and fixture.expected:
for rec in fixture.expected:
expected_output.setdefault(rec.table, []).append(rec.fields)
item_id = f"{fixture.name}--{fixture.mode}"
try:
lf.create_dataset_item(
dataset_name=dataset_name,
id=item_id,
input={
"directory": fixture.directory,
"data_types": fixture.data_types,
"mode": fixture.mode,
"seed_records": fixture.seed_records,
},
expected_output=expected_output,
metadata={"mode": fixture.mode},
)
except Exception as exc:
logger.warning(
"langfuse_eval: failed to upsert dataset item %s: %s", item_id, exc
)
lf.flush()
logger.info("langfuse_eval: synced fixture '%s' → dataset '%s'", fixture.name, dataset_name)
return dataset_name
def sync_journey_fixture_to_dataset(fixture) -> str | None:
"""Create or update a Langfuse dataset from a journey fixture.
Each journey fixture becomes a single dataset item with:
- input: {directory, data_types, user_messages}
- expected_output: {criteria}
"""
lf = _get_langfuse()
if lf is None:
logger.info("langfuse_eval: Langfuse not configured — skipping journey dataset sync")
return None
dataset_name = f"journey-eval-{fixture.name}"
try:
lf.create_dataset(
name=dataset_name,
description=fixture.description,
metadata={"type": "journey", "data_types": ",".join(fixture.data_types)},
)
except Exception:
pass # Dataset may already exist
item_id = f"{fixture.name}--journey"
try:
lf.create_dataset_item(
dataset_name=dataset_name,
id=item_id,
input={
"directory": fixture.directory,
"data_types": fixture.data_types,
"user_messages": fixture.user_messages,
},
expected_output={
"criteria": fixture.expected_template_criteria,
},
metadata={"type": "journey"},
)
except Exception as exc:
logger.warning("langfuse_eval: failed to upsert journey dataset item %s: %s", item_id, exc)
lf.flush()
logger.info("langfuse_eval: synced journey fixture '%s' → dataset '%s'", fixture.name, dataset_name)
return dataset_name
def create_eval_run(
dataset_name: str,
run_name: str,
*,
metadata: dict[str, Any] | None = None,
) -> str:
"""Create a dataset run in Langfuse. Returns the run name.
Note: In SDK v4, dataset runs are created implicitly via
dataset.run_experiment(). This function is kept for backwards
compatibility but may not create a run.
"""
lf = _get_langfuse()
if lf is None:
return run_name
try:
if hasattr(lf, "create_dataset_run"):
lf.create_dataset_run(
dataset_name=dataset_name,
run_name=run_name,
metadata=metadata or {},
)
lf.flush()
else:
logger.debug("langfuse_eval: create_dataset_run not available in SDK v4")
except Exception as exc:
logger.warning("langfuse_eval: failed to create run %s: %s", run_name, exc)
return run_name
def post_eval_scores(
scores: EvalScores,
*,
trace_id: str | None = None,
dataset_name: str | None = None,
run_name: str | None = None,
) -> None:
"""Post evaluation scores to Langfuse.
If trace_id is provided, scores are attached to that trace.
"""
lf = _get_langfuse()
if lf is None:
return
score_data = [
("precision", scores.precision),
("recall", scores.recall),
("f1", scores.f1),
]
# Only post field_accuracy when there are field-level scores (step2/full)
if scores.field_scores:
score_data.append(("field_accuracy", scores.field_accuracy))
if scores.llm_judge_score is not None:
score_data.append(("llm_judge", scores.llm_judge_score))
for name, value in score_data:
try:
lf.create_score(
name=name,
value=value,
trace_id=trace_id,
data_type="NUMERIC",
comment=f"{scores.fixture_name} | {scores.model} | {scores.prompt_variant}",
)
except Exception as exc:
logger.warning("langfuse_eval: failed to post score %s: %s", name, exc)
lf.flush()
logger.info(
"langfuse_eval: posted %d scores for %s/%s/%s",
len(score_data), scores.fixture_name, scores.model, scores.prompt_variant,
)
def log_eval_trace(
*,
fixture_name: str,
model: str,
prompt_variant: str,
prompt_template: str,
actual_mutations: list[dict],
scores_summary: dict[str, Any],
step1_results: list[dict] | None = None,
dataset_name: str | None = None,
run_name: str | None = None,
dataset_item_id: str | None = None,
langfuse_prompt_names: list[str] | None = None,
) -> str | None:
"""Create a Langfuse trace for one eval execution and link it to a dataset run.
Uses SDK v4 observation API (traces are created implicitly by root spans).
``langfuse_prompt_names`` can contain one or two prompt names to link
(e.g. ``["batch_file_classifier", "batch_processing"]`` for full mode).
Each prompt gets its own generation-type observation for per-version
metrics tracking.
Returns the trace_id, or None if Langfuse is unavailable.
"""
lf = _get_langfuse()
if lf is None:
return None
try:
from langfuse import propagate_attributes
# Fetch prompt objects for linking
prompt_objs: list[tuple[str, Any]] = []
for pname in (langfuse_prompt_names or []):
try:
obj = lf.get_prompt(name=pname, cache_ttl_seconds=300)
prompt_objs.append((pname, obj))
logger.info("langfuse_eval: linked prompt '%s' (type=%s)", pname, type(obj).__name__)
except Exception as exc:
logger.warning("langfuse_eval: prompt '%s' not found — %s", pname, exc)
# Build trace output dict
trace_output: dict[str, Any] = {"scores": scores_summary}
if step1_results:
trace_output["classifications"] = step1_results
if actual_mutations:
trace_output["mutations"] = actual_mutations[:50]
with propagate_attributes(
trace_name=f"eval-{fixture_name}",
metadata={
"eval": "true",
"fixture": fixture_name,
"model": model,
"prompt_variant": prompt_variant,
},
tags=["eval", f"model:{model}", f"variant:{prompt_variant}"],
):
# Root span for the eval run
span = lf.start_observation(name=f"eval-{fixture_name}")
span.update(
input={
"prompt_template": prompt_template,
"model": model,
"prompt_variant": prompt_variant,
},
output=trace_output,
)
trace_id = span.trace_id
# Create a generation-type observation per linked prompt
for pname, pobj in prompt_objs:
gen = lf.start_observation(
name=f"prompt-{pname}",
prompt=pobj,
as_type="generation",
)
gen.end()
# Link to dataset run if available
if dataset_name and run_name and dataset_item_id:
try:
dataset = lf.get_dataset(dataset_name)
for item in dataset.items:
if item.id == dataset_item_id:
item.link(span, run_name)
break
except Exception as exc:
logger.warning("langfuse_eval: failed to link trace to dataset run: %s", exc)
span.end()
lf.flush()
return trace_id
except Exception as exc:
logger.warning("langfuse_eval: failed to create eval trace: %s", exc)
return None

View File

@@ -0,0 +1,258 @@
"""Mock executor — intercepts execute_on_client for offline E2E testing.
Patches ``execute_on_client`` at all usage sites so agent pipeline runs don't
require a live Electron client or Redis. Instead:
- **Filesystem actions** (list_directory, read_file_content, get_file_metadata)
are served from local fixture files on disk.
- **Read actions** (select, get) return preseeded records from an in-memory
store provided by the test fixture.
- **Write actions** (insert, update, delete) are captured as *mutations* and
stored for later comparison against expected results.
"""
from __future__ import annotations
import json
import os
import time
import uuid
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from contextlib import contextmanager, asynccontextmanager
from unittest.mock import AsyncMock, patch
@dataclass
class Mutation:
"""A single recorded write operation."""
action: str # insert | update | delete
table: str
data: dict[str, Any]
timestamp: float = field(default_factory=time.time)
# ── Fake DB helpers (used to bypass async_session in full mode) ───────
class _FakeRow:
"""Mimics an AgentRunLog row returned by SQLAlchemy."""
id = 0
status = "running"
items_processed = 0
items_created = 0
errors: list[str] = []
completed_at = None
def __setattr__(self, name: str, value: Any) -> None:
object.__setattr__(self, name, value)
class _FakeResult:
"""Mimics a SQLAlchemy ``Result`` with ``scalar_one_or_none``."""
def __init__(self, row: _FakeRow) -> None:
self._row = row
def scalar_one_or_none(self) -> _FakeRow:
return self._row
@dataclass
class MockExecutor:
"""In-memory executor that replaces Redis-based tool round-trip.
Parameters
----------
fixture_dir : Path
Directory containing sample files for filesystem tool calls.
seed_records : dict[str, list[dict]]
Pre-existing records per table, e.g. ``{"tasks": [...], "projects": [...]}``.
The executor returns these for ``select`` / ``get`` actions and auto-updates
them on ``insert`` / ``update`` / ``delete`` so subsequent selects reflect changes.
"""
fixture_dir: Path
seed_records: dict[str, list[dict]] = field(default_factory=dict)
mutations: list[Mutation] = field(default_factory=list)
_id_counter: int = field(default=1000, repr=False)
# ── Public API ───────────────────────────────────────────────────
def reset(self) -> None:
"""Clear recorded mutations (keep seed_records intact)."""
self.mutations.clear()
def get_mutations(self, *, table: str | None = None, action: str | None = None) -> list[Mutation]:
"""Filter mutations by table and/or action."""
result = self.mutations
if table:
result = [m for m in result if m.table == table]
if action:
result = [m for m in result if m.action == action]
return result
def created_records(self, table: str) -> list[dict]:
"""Return data dicts of all inserts into *table*."""
return [m.data for m in self.mutations if m.table == table and m.action == "insert"]
def updated_records(self, table: str) -> list[dict]:
"""Return data dicts of all updates to *table*."""
return [m.data for m in self.mutations if m.table == table and m.action == "update"]
# ── Context manager for patching ──────────────────────────────
@contextmanager
def patch(self):
"""Patch execute_on_client and DB session at all usage sites."""
mock_fn = AsyncMock(side_effect=self._handle)
targets = [
"shared.ws_context.execute_on_client",
"app.agent_runner.execute_on_client",
"app.agents.filesystem_agent.execute_on_client",
]
# Mock async_session so run_local_agent / _finalize_run skip real DB
fake_row = _FakeRow()
fake_db = AsyncMock()
fake_db.commit = AsyncMock()
fake_db.refresh = AsyncMock()
fake_db.execute = AsyncMock(return_value=_FakeResult(fake_row))
fake_db.add = lambda obj: None # noqa: ARG005
@asynccontextmanager
async def _fake_session():
yield fake_db
patches = [patch(t, new=mock_fn) for t in targets]
patches.append(patch("app.agent_runner.async_session", _fake_session))
for p in patches:
p.start()
try:
yield mock_fn
finally:
for p in patches:
p.stop()
# ── Internal dispatch ─────────────────────────────────────────
async def _handle(
self,
action: str,
table: str | None = None,
data: dict[str, Any] | None = None,
filters: dict[str, Any] | None = None,
vector: list[float] | None = None,
limit: int | None = None,
) -> dict[str, Any]:
# Filesystem
if action == "list_directory":
return self._list_directory(data or {})
if action == "read_file_content":
return self._read_file(data or {})
if action == "get_file_metadata":
return self._get_file_metadata(data or {})
# CRUD
if action == "select":
return self._select(table or "", filters)
if action == "get":
return self._get(table or "", data or {})
if action == "insert":
return self._insert(table or "", data or {})
if action == "update":
return self._update(table or "", data or {})
if action == "delete":
return self._delete(table or "", data or {})
# Vector (no-op for eval)
if action in ("vector_upsert", "vector_search"):
return {"rows": []}
return {"error": f"Unknown action: {action}"}
# ── Filesystem handlers ───────────────────────────────────────
def _list_directory(self, data: dict) -> dict:
rel_path = data.get("path", "")
abs_path = self.fixture_dir / rel_path.lstrip("/\\")
if not abs_path.is_dir():
return {"entries": []}
entries: list[dict] = []
for child in sorted(abs_path.iterdir()):
entry_type = "directory" if child.is_dir() else "file"
# Return paths relative to fixture_dir but with the original prefix
entry_path = rel_path.rstrip("/\\") + "/" + child.name
entries.append({
"name": child.name,
"path": entry_path,
"type": entry_type,
})
return {"entries": entries}
def _read_file(self, data: dict) -> dict:
rel_path = data.get("path", "")
abs_path = self.fixture_dir / rel_path.lstrip("/\\")
if not abs_path.is_file():
return {"content": "", "error": f"File not found: {rel_path}"}
return {"content": abs_path.read_text(encoding="utf-8", errors="replace")}
def _get_file_metadata(self, data: dict) -> dict:
rel_path = data.get("path", "")
abs_path = self.fixture_dir / rel_path.lstrip("/\\")
if not abs_path.exists():
return {"error": f"Not found: {rel_path}"}
stat = abs_path.stat()
return {
"path": rel_path,
"size": stat.st_size,
"modifiedAt": int(stat.st_mtime * 1000),
"createdAt": int(stat.st_ctime * 1000),
"isDirectory": abs_path.is_dir(),
}
# ── CRUD handlers ─────────────────────────────────────────────
def _select(self, table: str, filters: dict | None) -> dict:
rows = list(self.seed_records.get(table, []))
if filters:
rows = [
r for r in rows
if all(r.get(k) == v for k, v in filters.items() if v is not None)
]
return {"rows": rows}
def _get(self, table: str, data: dict) -> dict:
record_id = data.get("id", "")
rows = self.seed_records.get(table, [])
for r in rows:
if r.get("id") == record_id:
return {"row": r}
return {"row": None}
def _insert(self, table: str, data: dict) -> dict:
self._id_counter += 1
record = {**data, "id": str(self._id_counter)}
# Add to seed so subsequent selects can find it
self.seed_records.setdefault(table, []).append(record)
self.mutations.append(Mutation(action="insert", table=table, data=record))
return {"row": record}
def _update(self, table: str, data: dict) -> dict:
record_id = data.get("id", "")
rows = self.seed_records.get(table, [])
for r in rows:
if r.get("id") == record_id:
r.update({k: v for k, v in data.items() if v is not None and v != ""})
self.mutations.append(Mutation(action="update", table=table, data=dict(r)))
return {"row": r}
# Record not found — still log the mutation
self.mutations.append(Mutation(action="update", table=table, data=data))
return {"row": data}
def _delete(self, table: str, data: dict) -> dict:
record_id = data.get("id", "")
rows = self.seed_records.get(table, [])
self.seed_records[table] = [r for r in rows if r.get("id") != record_id]
self.mutations.append(Mutation(action="delete", table=table, data={"id": record_id}))
return {"deleted": True}

View File

@@ -0,0 +1,2 @@
# Extra dependencies for the eval harness (on top of the service requirements.txt)
pyyaml>=6.0.0

View File

@@ -0,0 +1,463 @@
"""Eval runner — orchestrates fixture → mock → agent pipeline → scoring.
Supports three eval modes:
- **step1**: Test classification prompt only (``_STEP1_SYSTEM_PROMPT``).
Calls the LLM with fixture-provided ``domain_definitions`` and
``projects_list`` and compares output against ``expected_classification``.
- **step2**: Test processing prompt only (``_PROCESSING_SYSTEM_PROMPT``).
Compiles the prompt with fixture-provided ``existing_context``,
``project_context``, ``data_types``, and ``custom_prompt_section``,
then runs the tool-calling loop. Mutations are scored against
``expected`` records.
- **full**: Run ``run_local_agent()`` end-to-end (both steps).
Scored on both classification and extraction.
"""
from __future__ import annotations
import copy
import json
import logging
import time
import uuid
from typing import Any
from eval.config import EvalFixture, ExpectedClassification
from eval.mock_executor import MockExecutor
from eval.scorer import (
EvalScores,
FieldScore,
compute_precision_recall,
llm_judge_score,
score_field_match,
)
from eval import langfuse_eval
logger = logging.getLogger(__name__)
# ── Step 1 runner ─────────────────────────────────────────────────────────
async def _run_step1(
fixture: EvalFixture,
model: str,
mock: MockExecutor,
) -> list[dict[str, Any]]:
"""Run step-1 classification for each expected file.
Returns a list of result dicts:
``[{file, project_id, domains, new_project_name}, ...]``
"""
from app.agent_runner import _classify_file
results: list[dict[str, Any]] = []
for ec in fixture.expected_classification:
# Read the file content through the mock
file_result = await mock._handle(
action="read_file_content",
data={"path": ec.file},
)
file_content: str = file_result.get("content", "")
project_id, domains, new_name = await _classify_file(
file_path=ec.file,
file_content=file_content,
projects=fixture.projects_list,
config_data_types=fixture.data_types,
)
results.append({
"file": ec.file,
"project_id": project_id,
"domains": domains,
"new_project_name": new_name,
})
return results
def _score_step1(
fixture: EvalFixture,
results: list[dict[str, Any]],
) -> tuple[float, float, float, str]:
"""Score step-1 results. Returns (precision, recall, f1, reasoning)."""
if not fixture.expected_classification:
return 0.0, 0.0, 0.0, "No expected classifications"
total = len(fixture.expected_classification)
matched = 0
details: list[str] = []
for ec in fixture.expected_classification:
actual = next((r for r in results if r["file"] == ec.file), None)
if actual is None:
details.append(f" MISS {ec.file}: not processed")
continue
pid_ok = actual["project_id"] == ec.project_id
domains_ok = set(actual["domains"]) == set(ec.domains) if ec.domains else True
if pid_ok and domains_ok:
matched += 1
details.append(f" OK {ec.file}: project={actual['project_id']}, domains={actual['domains']}")
else:
parts: list[str] = []
if not pid_ok:
parts.append(f"project expected={ec.project_id} got={actual['project_id']}")
if not domains_ok:
parts.append(f"domains expected={ec.domains} got={actual['domains']}")
details.append(f" FAIL {ec.file}: {'; '.join(parts)}")
precision = matched / total if total > 0 else 0.0
recall = precision # in step1, precision == recall (same denominator)
f1 = precision # same
reasoning = "\n".join(details)
return precision, recall, f1, reasoning
# ── Step 2 runner ─────────────────────────────────────────────────────────
async def _run_step2(
fixture: EvalFixture,
model: str,
mock: MockExecutor,
) -> None:
"""Run step-2 processing for each file in the fixture directory.
Compiles ``_PROCESSING_SYSTEM_PROMPT`` with fixture-provided variables
and runs the tool-calling loop. Mutations are captured by the mock.
"""
from app.agent_runner import (
_PROCESSING_SYSTEM_PROMPT,
_build_processing_tools,
_run_agent_with_tools,
_MAX_PROCESSING_STEPS,
)
from app import tracing
# Compile the processing prompt with fixture variables
system_prompt = tracing.compile_prompt(
"batch_processing",
fallback=_PROCESSING_SYSTEM_PROMPT,
variables={
"existing_context": fixture.existing_context,
"project_context": fixture.project_context,
"data_types": ", ".join(fixture.data_types),
"custom_prompt_section": fixture.custom_prompt_section,
},
)
tools = _build_processing_tools(fixture.data_types)
# Scan files in the fixture directory
file_entries = await mock._handle(
action="list_directory",
data={"path": fixture.directory},
)
for entry in file_entries.get("entries", []):
if entry.get("type") != "file":
continue
# Filter by extension if specified
if fixture.file_extensions:
ext = entry["name"].rsplit(".", 1)[-1] if "." in entry["name"] else ""
if ext not in fixture.file_extensions:
continue
file_result = await mock._handle(
action="read_file_content",
data={"path": entry["path"]},
)
file_content: str = file_result.get("content", "")
if not file_content.strip():
continue
await _run_agent_with_tools(
system_prompt=system_prompt,
user_message=(
f"Process this file and extract relevant information.\n\n"
f"File: {entry['path']}\n\nContent:\n{file_content}"
),
tools=tools,
max_steps=_MAX_PROCESSING_STEPS,
)
# ── Full runner ───────────────────────────────────────────────────────────
async def _run_full(
fixture: EvalFixture,
model: str,
mock: MockExecutor,
user_id: str,
) -> None:
"""Run the full two-step pipeline via ``run_local_agent``."""
from app.agent_runner import run_local_agent
trigger_data: dict[str, Any] = {
"type": "agent_trigger",
"directory": fixture.directory,
"directory_paths": [fixture.directory],
"data_types": fixture.data_types,
"file_extensions": fixture.file_extensions,
"prompt_template": fixture.custom_prompt_section,
"device_id": "eval-harness",
"run_context": {
"agent_id": f"eval-{fixture.name}",
"run_id": None,
},
}
with mock.patch():
await run_local_agent(user_id, trigger_data)
# ── Scoring helpers ───────────────────────────────────────────────────────
def _score_mutations(
fixture: EvalFixture,
mock: MockExecutor,
) -> tuple[list[FieldScore], float, float, float, int, int]:
"""Score mutations against expected records.
Returns (field_scores, precision, recall, f1, extra, missing).
"""
all_field_scores: list[FieldScore] = []
total_expected = 0
total_actual = 0
total_matched = 0
total_extra = 0
total_missing = 0
expected_by_table: dict[str, list[dict]] = {}
for rec in fixture.expected:
expected_by_table.setdefault(rec.table, []).append(rec.fields)
tables = set(expected_by_table.keys()) | {m.table for m in mock.mutations}
for table in tables:
expected_records = expected_by_table.get(table, [])
actual_records = mock.created_records(table) + mock.updated_records(table)
field_scores, extra, missing = score_field_match(expected_records, actual_records, table)
all_field_scores.extend(field_scores)
matched = sum(1 for s in field_scores if s.best_match is not None)
total_expected += len(expected_records)
total_actual += len(actual_records)
total_matched += matched
total_extra += extra
total_missing += missing
precision, recall, f1 = compute_precision_recall(total_expected, total_actual, total_matched)
return all_field_scores, precision, recall, f1, total_extra, total_missing
# ── Main entry point ──────────────────────────────────────────────────────
async def run_single_eval(
fixture: EvalFixture,
model: str,
*,
use_llm_judge: bool = True,
judge_model: str = "gpt-4o-mini",
) -> EvalScores:
"""Execute one eval run for a fixture + model. Mode is read from the fixture."""
from shared.config import settings
from shared.ws_context import set_current_user, clear_current_user
seed = copy.deepcopy(fixture.seed_records)
mock = MockExecutor(
fixture_dir=fixture.fixture_path.parent,
seed_records=seed,
)
original_model = settings.LLM_MODEL
settings.LLM_MODEL = model
eval_user_id = str(uuid.uuid4())
logger.info(
"eval: starting %s | mode=%s | model=%s",
fixture.name, fixture.mode, model,
)
start_time = time.time()
step1_results: list[dict[str, Any]] = []
step1_reasoning = ""
try:
set_current_user(eval_user_id)
if fixture.mode == "step1":
with mock.patch():
step1_results = await _run_step1(fixture, model, mock)
elif fixture.mode == "step2":
with mock.patch():
await _run_step2(fixture, model, mock)
elif fixture.mode == "full":
with mock.patch():
# Step 1 — classification (independent from run_local_agent)
if fixture.expected_classification:
step1_results = await _run_step1(fixture, model, mock)
# Step 2 — full pipeline (run_local_agent handles both steps)
await _run_full(fixture, model, mock, eval_user_id)
except Exception as exc:
logger.error("eval: pipeline failed for %s/%s: %s", fixture.name, model, exc)
finally:
settings.LLM_MODEL = original_model
clear_current_user()
elapsed = time.time() - start_time
logger.info("eval: completed in %.1fs — %d mutations", elapsed, len(mock.mutations))
# ── Score ─────────────────────────────────────────────────────
if fixture.mode == "step1":
s1_precision, s1_recall, s1_f1, step1_reasoning = _score_step1(fixture, step1_results)
scores = EvalScores(
fixture_name=fixture.name,
model=model,
prompt_variant=fixture.mode,
precision=s1_precision,
recall=s1_recall,
f1=s1_f1,
llm_judge_reasoning=step1_reasoning,
)
else:
# step2 or full — score mutations
field_scores, precision, recall, f1, extra, missing = _score_mutations(fixture, mock)
scores = EvalScores(
fixture_name=fixture.name,
model=model,
prompt_variant=fixture.mode,
field_scores=field_scores,
precision=precision,
recall=recall,
f1=f1,
extra_records=extra,
missing_records=missing,
)
# Add step1 classification scores for full mode
if fixture.mode == "full" and fixture.expected_classification:
s1_p, s1_r, s1_f1, step1_reasoning = _score_step1(fixture, step1_results)
scores.llm_judge_reasoning = f"Step1 classification:\n{step1_reasoning}"
# Optional LLM judge for extraction quality
if use_llm_judge and fixture.expected:
all_expected = [r.fields for r in fixture.expected]
all_actual = [m.data for m in mock.mutations if m.action in ("insert", "update")]
judge_score, reasoning = await llm_judge_score(
all_expected, all_actual, judge_model=judge_model,
)
scores.llm_judge_score = judge_score
if step1_reasoning:
scores.llm_judge_reasoning += f"\n\nLLM judge:\n{reasoning}"
else:
scores.llm_judge_reasoning = reasoning
# ── Report to Langfuse ────────────────────────────────────────
prompt_names = {
"step1": ["batch_file_classifier"],
"step2": ["batch_processing"],
"full": ["batch_file_classifier", "batch_processing"],
}.get(fixture.mode, ["batch_processing"])
trace_id = langfuse_eval.log_eval_trace(
fixture_name=fixture.name,
model=model,
prompt_variant=fixture.mode,
prompt_template=fixture.custom_prompt_section or "(default)",
actual_mutations=[{"action": m.action, "table": m.table, "data": m.data} for m in mock.mutations],
scores_summary=scores.summary(),
step1_results=step1_results or None,
langfuse_prompt_names=prompt_names,
)
if trace_id:
langfuse_eval.post_eval_scores(scores, trace_id=trace_id)
# For full mode, post classification scores separately
if fixture.mode == "full" and fixture.expected_classification:
s1_p, s1_r, s1_f1, _ = _score_step1(fixture, step1_results)
for name, value in [
("classification_precision", s1_p),
("classification_recall", s1_r),
("classification_f1", s1_f1),
]:
try:
from langfuse import get_client
lf = get_client()
if lf:
lf.create_score(
name=name,
value=value,
trace_id=trace_id,
data_type="NUMERIC",
comment=f"{fixture.name} | {model} | full",
)
except Exception:
pass
return scores
async def run_fixture_eval(
fixture: EvalFixture,
models: list[str],
*,
use_llm_judge: bool = True,
judge_model: str = "gpt-4o-mini",
) -> list[EvalScores]:
"""Run all models for a fixture."""
langfuse_eval.sync_fixture_to_dataset(fixture)
results: list[EvalScores] = []
for model in models:
scores = await run_single_eval(
fixture, model,
use_llm_judge=use_llm_judge,
judge_model=judge_model,
)
results.append(scores)
return results
def print_results(results: list[EvalScores]) -> None:
"""Print a formatted summary table of eval results."""
if not results:
print("\nNo eval results.")
return
print("\n" + "=" * 95)
print(f"{'Fixture':<25} {'Mode':<6} {'Model':<25} {'P':>6} {'R':>6} {'F1':>6} {'FA':>6} {'LLM':>6}")
print("-" * 95)
for s in results:
llm_str = f"{s.llm_judge_score:.2f}" if s.llm_judge_score is not None else " --"
print(
f"{s.fixture_name:<25} {s.prompt_variant:<6} {s.model:<25} "
f"{s.precision:>6.2f} {s.recall:>6.2f} {s.f1:>6.2f} "
f"{s.field_accuracy:>6.2f} {llm_str:>6}"
)
print("=" * 95)
print()
print("=" * 90)
# If LLM judge reasoning is available, print it
for s in results:
if s.llm_judge_reasoning:
print(f"\n[{s.model} / {s.prompt_variant}] LLM Judge: {s.llm_judge_reasoning}")
print()

View File

@@ -0,0 +1,268 @@
"""Scoring functions for batch agent evaluation.
Two scoring strategies:
1. **FieldMatchScorer** — deterministic check: for each expected record,
find the best-matching actual record and compare specified fields.
Returns precision, recall, and per-field accuracy.
2. **LLMJudgeScorer** — uses a secondary LLM to semantically evaluate
whether the actual extractions satisfy the expected intent, even if
wording differs. Returns a 0-1 score + reasoning.
"""
from __future__ import annotations
import json
import logging
from dataclasses import dataclass, field
from difflib import SequenceMatcher
from typing import Any
from langchain_core.messages import HumanMessage, SystemMessage
logger = logging.getLogger(__name__)
# ── Result types ─────────────────────────────────────────────────────────
@dataclass
class FieldScore:
"""Score for a single expected record against its best match."""
expected: dict[str, Any]
best_match: dict[str, Any] | None
matched_fields: dict[str, bool]
similarity: float # 0-1 overall similarity
@property
def field_accuracy(self) -> float:
if not self.matched_fields:
return 0.0
return sum(self.matched_fields.values()) / len(self.matched_fields)
@dataclass
class EvalScores:
"""Aggregated scores for one eval run."""
fixture_name: str
model: str
prompt_variant: str
field_scores: list[FieldScore] = field(default_factory=list)
precision: float = 0.0
recall: float = 0.0
f1: float = 0.0
llm_judge_score: float | None = None
llm_judge_reasoning: str = ""
extra_records: int = 0 # records created but not expected
missing_records: int = 0 # expected but not found
@property
def field_accuracy(self) -> float:
if not self.field_scores:
return 0.0
return sum(s.field_accuracy for s in self.field_scores) / len(self.field_scores)
def summary(self) -> dict[str, Any]:
return {
"fixture": self.fixture_name,
"model": self.model,
"prompt_variant": self.prompt_variant,
"precision": round(self.precision, 3),
"recall": round(self.recall, 3),
"f1": round(self.f1, 3),
"field_accuracy": round(self.field_accuracy, 3),
"llm_judge_score": round(self.llm_judge_score, 3) if self.llm_judge_score is not None else None,
"extra_records": self.extra_records,
"missing_records": self.missing_records,
}
# ── Field Match Scorer ───────────────────────────────────────────────────
def _normalize(value: Any) -> str:
"""Normalize a value for comparison."""
if value is None:
return ""
return str(value).strip().lower()
def _text_similarity(a: str, b: str) -> float:
"""Fuzzy text similarity using SequenceMatcher."""
if not a and not b:
return 1.0
if not a or not b:
return 0.0
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
def _find_best_match(
expected: dict[str, Any],
actuals: list[dict[str, Any]],
) -> tuple[dict[str, Any] | None, float]:
"""Find the actual record most similar to expected, return (match, similarity)."""
if not actuals:
return None, 0.0
best_match = None
best_score = 0.0
# Primary matching key: title or name
expected_title = _normalize(expected.get("title", expected.get("name", "")))
for actual in actuals:
actual_title = _normalize(actual.get("title", actual.get("name", "")))
sim = _text_similarity(expected_title, actual_title)
if sim > best_score:
best_score = sim
best_match = actual
return best_match, best_score
def _compare_fields(
expected: dict[str, Any],
actual: dict[str, Any],
) -> dict[str, bool]:
"""Compare each expected field against the actual record."""
results: dict[str, bool] = {}
for key, expected_val in expected.items():
actual_val = actual.get(key)
# Exact match for non-string types
if not isinstance(expected_val, str):
results[key] = actual_val == expected_val
else:
# Fuzzy match for strings (threshold: 0.7)
results[key] = _text_similarity(
_normalize(expected_val), _normalize(actual_val)
) >= 0.7
return results
def score_field_match(
expected_records: list[dict[str, Any]],
actual_records: list[dict[str, Any]],
table: str,
) -> tuple[list[FieldScore], int, int]:
"""Score actual extractions against expected records for one table.
Returns (field_scores, extra_count, missing_count).
"""
field_scores: list[FieldScore] = []
matched_actuals: set[int] = set()
for exp in expected_records:
# Find best match among unmatched actuals
candidates = [
(i, a) for i, a in enumerate(actual_records) if i not in matched_actuals
]
if not candidates:
field_scores.append(FieldScore(
expected=exp, best_match=None, matched_fields={}, similarity=0.0,
))
continue
best_idx, best_match = None, None
best_sim = 0.0
for idx, actual in candidates:
_, sim = _find_best_match(exp, [actual])
if sim > best_sim:
best_sim = sim
best_idx = idx
best_match = actual
if best_sim >= 0.5 and best_match is not None:
matched_actuals.add(best_idx)
matched_fields = _compare_fields(exp, best_match)
field_scores.append(FieldScore(
expected=exp, best_match=best_match,
matched_fields=matched_fields, similarity=best_sim,
))
else:
field_scores.append(FieldScore(
expected=exp, best_match=None, matched_fields={}, similarity=0.0,
))
extra_count = len(actual_records) - len(matched_actuals)
missing_count = sum(1 for s in field_scores if s.best_match is None)
return field_scores, extra_count, missing_count
def compute_precision_recall(
expected_count: int,
actual_count: int,
matched_count: int,
) -> tuple[float, float, float]:
"""Compute precision, recall, F1."""
precision = matched_count / actual_count if actual_count > 0 else 0.0
recall = matched_count / expected_count if expected_count > 0 else 0.0
f1 = (
2 * precision * recall / (precision + recall)
if (precision + recall) > 0
else 0.0
)
return precision, recall, f1
# ── LLM Judge Scorer ─────────────────────────────────────────────────────
_JUDGE_SYSTEM_PROMPT = """\
You are an evaluation judge for a data extraction system.
Your task is to compare the EXPECTED extractions against the ACTUAL extractions
produced by an AI agent, and assess quality on a 0-1 scale.
Scoring criteria:
- 1.0: All expected records found with correct fields, no significant extras
- 0.8: Most expected records found, minor field differences or extras
- 0.6: Core extractions present but some missing or incorrect
- 0.4: Partial match — several expected records missing or wrong
- 0.2: Poor quality — most expected records missing or incorrect
- 0.0: Complete failure — no meaningful overlap
Consider semantic equivalence: "Send invoice" and "Email the invoice" are matches.
Ignore field ordering and formatting differences.
Respond with ONLY a JSON object:
{"score": 0.85, "reasoning": "Brief explanation of the score"}
"""
async def llm_judge_score(
expected: list[dict[str, Any]],
actual: list[dict[str, Any]],
*,
judge_model: str = "gpt-4o-mini",
) -> tuple[float, str]:
"""Use an LLM to semantically evaluate extraction quality.
Returns (score, reasoning).
"""
from shared.llm import get_llm
llm = get_llm(model=judge_model, temperature=0)
user_content = (
f"## Expected extractions\n```json\n{json.dumps(expected, indent=2, default=str)}\n```\n\n"
f"## Actual extractions\n```json\n{json.dumps(actual, indent=2, default=str)}\n```"
)
try:
response = await llm.ainvoke([
SystemMessage(content=_JUDGE_SYSTEM_PROMPT),
HumanMessage(content=user_content),
])
raw = response.content.strip()
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
parsed = json.loads(raw.strip())
return float(parsed.get("score", 0.0)), str(parsed.get("reasoning", ""))
except Exception as exc:
logger.warning("eval: LLM judge failed: %s", exc)
return 0.0, f"Judge error: {exc}"

View File

@@ -14,6 +14,7 @@ langchain-litellm>=0.3.0
litellm>=1.50.0 litellm>=1.50.0
openai>=1.50.0 openai>=1.50.0
httpx>=0.27.0 httpx>=0.27.0
langfuse>=3.0.0
croniter>=2.0.0 croniter>=2.0.0
google-api-python-client>=2.130.0 google-api-python-client>=2.130.0
google-auth>=2.30.0 google-auth>=2.30.0

View File

@@ -1 +0,0 @@
"""Chat Service domain agents."""

View File

@@ -16,13 +16,13 @@ from typing import Any, Literal
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage, ToolMessage from langchain_core.messages import AIMessage, HumanMessage, SystemMessage, ToolMessage
from langchain_core.tools import tool from langchain_core.tools import tool
from app.agents.note_agent import NOTE_TOOLS from shared.agents.note_agent import NOTE_TOOLS
from app.agents.project_agent import PROJECT_TOOLS from shared.agents.project_agent import PROJECT_TOOLS
from app.agents.task_agent import TASK_TOOLS from shared.agents.task_agent import TASK_TOOLS
from app.agents.timeline_agent import TIMELINE_TOOLS from shared.agents.timeline_agent import TIMELINE_TOOLS
from app.llm import get_llm from shared.llm import get_llm
from app.memory_middleware import MemoryMiddleware from app.memory_middleware import MemoryMiddleware
from app.ws_context import clear_tool_result_collector, execute_on_client, set_tool_result_collector from shared.ws_context import clear_tool_result_collector, execute_on_client, set_tool_result_collector
from app import tracing from app import tracing
from shared.db import async_session from shared.db import async_session

View File

@@ -33,6 +33,8 @@ def _api_key_for_model(model: str) -> str | None:
return settings.GOOGLE_API_KEY or None return settings.GOOGLE_API_KEY or None
if model.startswith("cerebras/"): if model.startswith("cerebras/"):
return settings.CEREBRAS_API_KEY or None return settings.CEREBRAS_API_KEY or None
if model.startswith("github/"):
return settings.GITHUB_TOKEN or None
if model.startswith("github_copilot/"): if model.startswith("github_copilot/"):
return None return None
return settings.OPENAI_API_KEY or None return settings.OPENAI_API_KEY or None
@@ -49,6 +51,9 @@ def get_llm(
if settings.GITHUB_COPILOT_TOKEN_DIR: if settings.GITHUB_COPILOT_TOKEN_DIR:
os.environ.setdefault("GITHUB_COPILOT_TOKEN_DIR", settings.GITHUB_COPILOT_TOKEN_DIR) os.environ.setdefault("GITHUB_COPILOT_TOKEN_DIR", settings.GITHUB_COPILOT_TOKEN_DIR)
if settings.GITHUB_TOKEN:
os.environ.setdefault("GITHUB_TOKEN", settings.GITHUB_TOKEN)
if "/" in model: if "/" in model:
return ChatLiteLLM(model=model, temperature=temperature, callbacks=callbacks) return ChatLiteLLM(model=model, temperature=temperature, callbacks=callbacks)

View File

@@ -17,7 +17,7 @@ from shared.redis import redis_client, ws_out_channel
from app.deep_agent import run_floating_stream, run_home_stream from app.deep_agent import run_floating_stream, run_home_stream
from app.memory_middleware import MemoryMiddleware from app.memory_middleware import MemoryMiddleware
from app.output_formatter import StreamFormatter from app.output_formatter import StreamFormatter
from app.ws_context import clear_current_user, set_current_user from shared.ws_context import clear_current_user, set_current_user
from app import tracing from app import tracing
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)

View File

@@ -8,7 +8,7 @@ from fastapi.responses import JSONResponse
from shared.schemas import ChatRequest from shared.schemas import ChatRequest
from app.deep_agent import run_home from app.deep_agent import run_home
from app.ws_context import clear_current_user, set_current_user from shared.ws_context import clear_current_user, set_current_user
router = APIRouter(prefix="/chat", tags=["chat"]) router = APIRouter(prefix="/chat", tags=["chat"])

View File

@@ -167,9 +167,9 @@ def get_prompt(
fallback: str | None = None, fallback: str | None = None,
cache_ttl_seconds: int = 300, cache_ttl_seconds: int = 300,
) -> str | None: ) -> str | None:
"""Fetch a managed prompt from Langfuse by name. """Fetch a managed prompt from Langfuse by name (without variable compilation).
Returns the compiled prompt string, or *fallback* if the prompt is not Returns the raw prompt string, or *fallback* if the prompt is not
found or Langfuse is disabled. found or Langfuse is disabled.
""" """
lf = _get_client() lf = _get_client()
@@ -192,6 +192,46 @@ def get_prompt(
return fallback return fallback
def compile_prompt(
name: str,
*,
fallback: str,
variables: dict[str, str],
version: int | None = None,
label: str | None = None,
cache_ttl_seconds: int = 300,
) -> str:
"""Fetch a managed prompt from Langfuse and compile it with ``{{variables}}``.
If the prompt exists in Langfuse, uses the SDK's ``.compile(**variables)``
which replaces ``{{key}}`` placeholders. If Langfuse is disabled or the
prompt is not found, falls back to ``fallback.format(**variables)`` (Python
``{key}`` placeholders).
This means:
- Langfuse prompts use ``{{variable}}`` syntax.
- Hardcoded fallback strings use Python ``{variable}`` syntax.
"""
lf = _get_client()
if lf is None:
return fallback.format(**variables)
try:
kwargs: dict[str, Any] = {
"name": name,
"cache_ttl_seconds": cache_ttl_seconds,
}
if version is not None:
kwargs["version"] = version
if label is not None:
kwargs["label"] = label
prompt = lf.get_prompt(**kwargs)
return prompt.compile(**variables)
except Exception as exc:
logger.warning("tracing: compile_prompt(%s) failed, using fallback: %s", name, exc)
return fallback.format(**variables)
def link_prompt_to_trace( def link_prompt_to_trace(
span: Any, span: Any,
prompt_name: str, prompt_name: str,

View File

@@ -1,115 +0,0 @@
"""WebSocket context for Chat Service — Redis-based tool call round-trip.
Replaces the monolith's ws_context.py. Instead of calling Electron directly
via WebSocket, this publishes tool_call frames to Redis (ws:out:{user_id})
and awaits the result via BRPOP on tool:result:{call_id}.
"""
from __future__ import annotations
import json
import logging
from contextvars import ContextVar
from typing import Any
from uuid import uuid4
from shared.redis import redis_client, tool_result_key, ws_out_channel
logger = logging.getLogger(__name__)
_TOOL_CALL_TIMEOUT = 30 # seconds — BRPOP timeout
# Per-request user_id context var (set before agent runs)
_current_user_id: ContextVar[str | None] = ContextVar("_current_user_id", default=None)
# Optional collector for debug
_tool_result_collector: ContextVar[list[dict] | None] = ContextVar(
"_tool_result_collector", default=None
)
def set_current_user(user_id: str) -> None:
_current_user_id.set(user_id)
def clear_current_user() -> None:
_current_user_id.set(None)
def set_tool_result_collector(lst: list[dict]) -> None:
_tool_result_collector.set(lst)
def clear_tool_result_collector() -> None:
_tool_result_collector.set(None)
async def execute_on_client(
action: str,
table: str | None = None,
data: dict[str, Any] | None = None,
filters: dict[str, Any] | None = None,
vector: list[float] | None = None,
limit: int | None = None,
) -> dict[str, Any]:
"""Send a tool_call to Electron via Redis and await the result.
1. Build tool_call payload
2. Publish to ws:out:{user_id} (WS Gateway forwards to Electron)
3. BRPOP on tool:result:{call_id} (WS Gateway pushes when Electron replies)
4. Return result dict
Raises RuntimeError if no user_id is set or if the call times out.
"""
user_id = _current_user_id.get()
if not user_id:
raise RuntimeError(
"execute_on_client() called without a user_id — "
"set_current_user() must be called first."
)
call_id = str(uuid4())
payload: dict[str, Any] = {
"type": "tool_call",
"id": call_id,
"action": action,
}
if table is not None:
payload["table"] = table
if data is not None:
payload["data"] = data
if filters is not None:
payload["filters"] = {k: v for k, v in filters.items() if v is not None}
if vector is not None:
payload["vector"] = vector
if limit is not None:
payload["limit"] = limit
# Publish tool_call to WS Gateway → Electron
channel = ws_out_channel(user_id)
await redis_client.publish(channel, json.dumps(payload))
# Wait for Electron's tool_result
result_key = tool_result_key(call_id)
response = await redis_client.brpop(result_key, timeout=_TOOL_CALL_TIMEOUT)
if response is None:
raise RuntimeError(
f"Tool call {call_id} timed out after {_TOOL_CALL_TIMEOUT}s — "
f"device may be offline or unresponsive."
)
# response is (key, value) tuple
_, raw = response
result = json.loads(raw)
# Collect for debug if requested
collector = _tool_result_collector.get(None)
if collector is not None:
collector.append({
"action": action,
"table": table,
"data": result,
})
return result

View File

@@ -0,0 +1 @@
"""Shared domain agents — tool definitions used by both Chat and Batch Agent services."""

View File

@@ -1,6 +1,6 @@
"""Note agent — Markdown note management (list, get, create, update, delete). """Note agent — Markdown note management (list, get, create, update, delete).
Adapted for Chat Service: import from app.ws_context and app.llm. Shared tool definitions used by both Chat and Batch Agent services.
""" """
from __future__ import annotations from __future__ import annotations
@@ -10,8 +10,8 @@ from typing import Any
from langchain_core.tools import tool from langchain_core.tools import tool
from app.llm import embed from shared.llm import embed
from app.ws_context import execute_on_client from shared.ws_context import execute_on_client
_UUID_RE = re.compile( _UUID_RE = re.compile(
r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$" r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"

View File

@@ -1,6 +1,6 @@
"""Project agent — full lifecycle management (list, get, create, update, archive, delete). """Project agent — full lifecycle management (list, get, create, update, archive, delete).
Adapted for Chat Service: import from app.ws_context instead of app.core.ws_context. Shared tool definitions used by both Chat and Batch Agent services.
""" """
from __future__ import annotations from __future__ import annotations
@@ -9,7 +9,7 @@ from typing import Any
from langchain_core.tools import tool from langchain_core.tools import tool
from app.ws_context import execute_on_client from shared.ws_context import execute_on_client
PROJECT_SYSTEM_PROMPT = ( PROJECT_SYSTEM_PROMPT = (
"You are a project management assistant. You help users create, find,\n" "You are a project management assistant. You help users create, find,\n"

View File

@@ -1,6 +1,6 @@
"""Task agent — full CRUD for tasks and task comments. """Task agent — full CRUD for tasks and task comments.
Adapted for Chat Service: import from app.ws_context instead of app.core.ws_context. Shared tool definitions used by both Chat and Batch Agent services.
""" """
from __future__ import annotations from __future__ import annotations
@@ -11,7 +11,7 @@ from typing import Any
from langchain_core.tools import tool from langchain_core.tools import tool
from app.ws_context import execute_on_client from shared.ws_context import execute_on_client
_UUID_RE = re.compile( _UUID_RE = re.compile(
r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$" r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
@@ -32,7 +32,6 @@ TASK_SYSTEM_PROMPT = (
" - project_id is optional; link to a project when the user mentions one\n" " - project_id is optional; link to a project when the user mentions one\n"
" - is_ai_suggested: 1 only when proactively proposing a task the user\n" " - is_ai_suggested: 1 only when proactively proposing a task the user\n"
" did not explicitly request; 0 otherwise\n" " did not explicitly request; 0 otherwise\n"
" - is_ai_suggested: 1 only when proactively proposing a task the user did not explicitly request; 0 otherwise\n"
" - Use list_tasks_due_today for 'what's due today' queries\n" " - Use list_tasks_due_today for 'what's due today' queries\n"
" - For update_task, use -1 for integer fields you do not want to change\n" " - For update_task, use -1 for integer fields you do not want to change\n"
" - Always confirm the action in plain, user-friendly language." " - Always confirm the action in plain, user-friendly language."
@@ -225,7 +224,7 @@ async def delete_task_comment(comment_id: str) -> str:
return f"Comment {comment_id} deleted." return f"Comment {comment_id} deleted."
# ── Agent ───────────────────────────────────────────────────────────── # ── Exports ───────────────────────────────────────────────────────────
TASK_TOOLS: list[Any] = [ TASK_TOOLS: list[Any] = [

View File

@@ -1,6 +1,6 @@
"""Timeline agent — project milestone management (list, create, update, delete). """Timeline agent — project milestone management (list, create, update, delete).
Adapted for Chat Service: import from app.ws_context instead of app.core.ws_context. Shared tool definitions used by both Chat and Batch Agent services.
""" """
from __future__ import annotations from __future__ import annotations
@@ -10,7 +10,7 @@ from typing import Any
from langchain_core.tools import tool from langchain_core.tools import tool
from app.ws_context import execute_on_client from shared.ws_context import execute_on_client
_UUID_RE = re.compile( _UUID_RE = re.compile(
r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$" r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
@@ -28,7 +28,6 @@ TIMELINE_SYSTEM_PROMPT = (
" - For listing, project_id must be a UUID; never pass plain names as project_id\n" " - For listing, project_id must be a UUID; never pass plain names as project_id\n"
" - date is a Unix timestamp in milliseconds; convert human-readable dates\n" " - date is a Unix timestamp in milliseconds; convert human-readable dates\n"
" - is_ai_suggested: 1 when proactively proposing a timeline, 0 otherwise\n" " - is_ai_suggested: 1 when proactively proposing a timeline, 0 otherwise\n"
" - is_ai_suggested: 1 when proactively proposing a timeline, 0 otherwise\n"
" - For update_timeline, use -1 for integer fields you do not want to change\n" " - For update_timeline, use -1 for integer fields you do not want to change\n"
" - Listing without a project_id returns all timelines across projects\n" " - Listing without a project_id returns all timelines across projects\n"
" - Always echo the title and formatted date in your confirmation." " - Always echo the title and formatted date in your confirmation."

View File

@@ -62,6 +62,7 @@ class Settings(BaseSettings):
ANTHROPIC_API_KEY: str = "" ANTHROPIC_API_KEY: str = ""
GOOGLE_API_KEY: str = "" GOOGLE_API_KEY: str = ""
CEREBRAS_API_KEY: str = "" CEREBRAS_API_KEY: str = ""
GITHUB_TOKEN: str = ""
LLM_MODEL: str = "gpt-4o" LLM_MODEL: str = "gpt-4o"
LLM_EMBED_MODEL: str = "text-embedding-3-small" LLM_EMBED_MODEL: str = "text-embedding-3-small"

77
shared/llm.py Normal file
View File

@@ -0,0 +1,77 @@
"""LLM factory — centralised model instantiation via LiteLLM.
Shared by Chat and Batch Agent services.
Uses shared.config.settings for all configuration.
"""
from __future__ import annotations
import os
import warnings
from openai import AsyncOpenAI
import litellm
from langchain_openai import ChatOpenAI
from langchain_litellm import ChatLiteLLM
from shared.config import settings
litellm.drop_params = True
warnings.filterwarnings(
"ignore",
message=r"PydanticSerializationUnexpectedValue\(Expected `ResponseAPIUsage`",
category=UserWarning,
)
def _api_key_for_model(model: str) -> str | None:
if model.startswith("anthropic/"):
return settings.ANTHROPIC_API_KEY or None
if model.startswith("gemini/") or model.startswith("google/"):
return settings.GOOGLE_API_KEY or None
if model.startswith("cerebras/"):
return settings.CEREBRAS_API_KEY or None
if model.startswith("github/"):
return settings.GITHUB_TOKEN or None
if model.startswith("github_copilot/"):
return None
return settings.OPENAI_API_KEY or None
def get_llm(
*,
model: str | None = None,
temperature: float = 0,
callbacks: list | None = None,
) -> ChatOpenAI | ChatLiteLLM:
model = model or settings.LLM_MODEL
if settings.GITHUB_COPILOT_TOKEN_DIR:
os.environ.setdefault("GITHUB_COPILOT_TOKEN_DIR", settings.GITHUB_COPILOT_TOKEN_DIR)
if settings.GITHUB_TOKEN:
os.environ.setdefault("GITHUB_TOKEN", settings.GITHUB_TOKEN)
if "/" in model:
return ChatLiteLLM(model=model, temperature=temperature, callbacks=callbacks)
return ChatOpenAI(
model=model,
temperature=temperature,
api_key=_api_key_for_model(model),
callbacks=callbacks,
)
async def embed(text: str) -> list[float]:
model = settings.LLM_EMBED_MODEL
if model.startswith("github_copilot/") or "/" in model:
response = await litellm.aembedding(model=model, input=[text])
return response.data[0]["embedding"]
client = AsyncOpenAI(api_key=settings.OPENAI_API_KEY)
response = await client.embeddings.create(model=model, input=text)
return response.data[0].embedding

View File

@@ -1,12 +1,12 @@
"""WebSocket context for Batch Agent Service — Redis-based tool call round-trip. """WebSocket context — Redis-based tool call round-trip.
Same pattern as services/chat/app/ws_context.py: publishes tool_call frames Shared by Chat and Batch Agent services. Publishes tool_call frames to
to Redis ws:out:{user_id} and awaits BRPOP on tool:result:{call_id}. Redis ``ws:out:{user_id}`` and awaits the result via BRPOP on
``tool:result:{call_id}``.
Additionally provides set_client_executor / clear_client_executor stubs Also provides ``set_client_executor`` / ``clear_client_executor`` no-op
for backward compatibility with the agent_runner code (which originally shims for backward compatibility with agent_runner code that originally
used a DeviceConnectionManager callback). In the microservice world these used a DeviceConnectionManager callback.
are no-ops execute_on_client() always uses the Redis path.
""" """
from __future__ import annotations from __future__ import annotations
@@ -23,10 +23,10 @@ logger = logging.getLogger(__name__)
_TOOL_CALL_TIMEOUT = 30 # seconds — BRPOP timeout _TOOL_CALL_TIMEOUT = 30 # seconds — BRPOP timeout
# Per-request user_id context var (set before agent run) # Per-request user_id context var (set before agent runs)
_current_user_id: ContextVar[str | None] = ContextVar("_current_user_id", default=None) _current_user_id: ContextVar[str | None] = ContextVar("_current_user_id", default=None)
# Optional collector for debug / logging # Optional collector for debug
_tool_result_collector: ContextVar[list[dict] | None] = ContextVar( _tool_result_collector: ContextVar[list[dict] | None] = ContextVar(
"_tool_result_collector", default=None "_tool_result_collector", default=None
) )
@@ -51,17 +51,14 @@ def clear_tool_result_collector() -> None:
# ── Compatibility shims ────────────────────────────────────────────────── # ── Compatibility shims ──────────────────────────────────────────────────
# agent_runner.py originally called set_client_executor / clear_client_executor # agent_runner.py originally called set_client_executor / clear_client_executor
# with a DeviceConnectionManager callback. In the microservice world the # with a DeviceConnectionManager callback. In the microservice world the
# Redis-based execute_on_client replaces this, so these are no-ops that # Redis-based execute_on_client replaces this, so these are no-ops.
# keep the agent_runner code unchanged.
def set_client_executor(fn: Callable[[dict], Coroutine[Any, Any, dict]] | None) -> None: def set_client_executor(fn: Callable[[dict], Coroutine[Any, Any, dict]] | None) -> None:
"""No-op — kept for agent_runner compatibility.""" """No-op — kept for agent_runner compatibility."""
pass
def clear_client_executor() -> None: def clear_client_executor() -> None:
"""No-op — kept for agent_runner compatibility.""" """No-op — kept for agent_runner compatibility."""
pass
async def execute_on_client( async def execute_on_client(