Files

Roberto 732a4c42f8 docs: add scouts refactor + gmail scout design spec

Phases 1-3 in scope: rename agents → scouts (UI/code/Postgres/SQLite/
Langfuse), Gmail cloud scout w/ two-stage pipeline, SourceConnector
abstraction. Phase 4 (Stage 2 categorization + HITL surface in brief)
deferred to task-brief rework.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-15 23:15:46 +02:00

19 KiB

Raw Blame History

Scouts Refactor + Gmail Integration — Design

Date: 2026-05-15 Status: Draft, awaiting user review Owner: Roberto

Summary

Rename the existing "Agents" subsystem to "Scouts" across the entire stack (UI, code, Postgres, SQLite, Langfuse), then add the first cloud scout — Gmail — using a two-stage pipeline that respects zero-trust (no email content stored on backend) and human-in-the-loop (no entities created autonomously).

The implementation is split into four phases. Phases 1–3 ship now. Phase 4 (Stage 2 categorization, HITL surface in the brief, conversion-to-entity mutations) is deferred to the planned task-brief rework.

Goals

Unify the user-facing "data source watchers" concept under one name: Scout.
Land a SourceConnector abstraction so future cloud scouts (Slack/Teams/Outlook/RSS/...) reuse the same engine, queue, delivery channel, and HITL surface — only the per-source connector is new.
Ship a Gmail scout end-to-end with: OAuth, push (users.watch) + cron-fallback polling, BE-side spam triage, encrypted token storage, opt-in spam auto-trash.
Preserve zero-trust: Gmail bodies are fetched transiently for the triage LLM call and discarded; only {message_id, scout_id, verdict, status} is persisted on BE.
Preserve HITL on the cloud path: scouts never create tasks/projects/events/notes autonomously; they accumulate proposals that the user resolves later from the brief.

Non-Goals (Phase 4, separate spec)

Stage 2 categorization agent prompt + tool palette.
HITL UI in the task brief (suggestion cards, approve/reject controls, convert-to-entity mutations, list_pending_scout_suggestions brief tool).
Local scout behavior change. Local directory monitor keeps current "auto-create" semantics. HITL is opt-in for local scouts in a future migration.
Schema unification of LocalScoutConfig + CloudScoutConfig. They have different behaviors; keep separate tables.
Connectors other than Gmail (Slack/Teams/Outlook).
Stripe/billing changes (existing tier checks suffice).

Constraints

Pre-1.0 dev: no production users, no backwards-compatibility shims, no Alembic data migrations beyond rename. Drop-and-recreate is acceptable where simpler.
Zero-trust: BE never persists user content. Gmail bodies are read transiently for the triage LLM call only.
HITL (cloud path): scouts produce proposals, never entities.
Spam auto-trash: off by default per scout; opt-in via UI toggle. Action is "move to Trash" (Gmail's 30d recovery), never permanent delete.
Reusability: cloud-scout pipeline (connector → triage → queue → deliver-on-connect → HITL) is shared infra; Gmail is just the first connector.

Architecture

Two-stage pipeline (cloud scouts only)

[Gmail] --push/cron--> [BE Stage 1: Triage]                  [Electron Stage 2: Categorize]
                          |                                     |
                          v                                     v
                       fetch body (transient)               drain queue on WS reconnect
                          |                                     |
                          v                                     v
                       LLM relevance call                   fetch metadata for each msg
                          |                                     |
                          +-- spam + auto_trash_spam: archive   v
                          |                                  insert scout_suggestions row
                          +-- relevant: insert queue row     (category='unprocessed' stub
                                                              until Phase 4)

Stage 1 (BE, always-on): verdict only. Stores {msg_id, verdict, status}. No content.

Stage 2 (Electron, on connect): Phase 3 ships a stub that simply mirrors the queue into a local SQLite table with category='unprocessed'. Phase 4 swaps in the real categorization agent.

Local scouts (unchanged behaviorally)

Local directory monitor keeps current Electron-side scheduling and auto-creation. Only renames apply.

SourceConnector abstraction

A SourceConnector Protocol owns all source-specific I/O. The shared ScoutEngine owns triage, queueing, delivery, and ack handling. To add a new cloud scout: implement one connector class + register it.

# app/scouts/connectors/base.py
class SourceConnector(Protocol):
    source_type: str  # "gmail"

    async def list_new(self, scout: CloudScoutConfig) -> list[ItemRef]: ...
    async def fetch_metadata(self, scout: CloudScoutConfig, ref: ItemRef) -> ItemMetadata: ...
    async def fetch_content(self, scout: CloudScoutConfig, ref: ItemRef) -> ItemContent: ...
    async def archive(self, scout: CloudScoutConfig, ref: ItemRef) -> None: ...
    async def setup_watch(self, scout: CloudScoutConfig) -> None: ...
    async def renew_watch(self, scout: CloudScoutConfig) -> None: ...

ItemContent.body_text is in-memory only; never persisted.

ScoutEngine

class ScoutEngine:
    async def trigger_scout(self, scout_id: UUID) -> None: ...
    async def _process_item(self, scout, connector, ref) -> None: ...
    async def deliver_pending(self, user_id: UUID, ws: DeviceWS) -> None: ...

Both webhook and cron-fallback entry points call trigger_scout.

Data Model

Postgres (BE)

Renames (Phase 1, single Alembic migration)

Before	After
Table `local_agent_configs`	`local_scout_configs`
Table `cloud_agent_configs`	`cloud_scout_configs`
Table `agent_run_logs`	`scout_run_logs`
Column `agent_config`	`scout_config`
Column `agent_id` (FKs)	`scout_id`
Column `agent_run_id`	`scout_run_id`
Class `LocalAgentConfig`	`LocalScoutConfig`
Class `CloudAgentConfig`	`CloudScoutConfig`
Class `AgentRunLog`	`ScoutRunLog`

New (Phase 2)

CREATE TABLE scout_triage_queue (
  id              uuid PRIMARY KEY,
  user_id         uuid NOT NULL REFERENCES users(id),
  scout_id        uuid NOT NULL REFERENCES cloud_scout_configs(id),
  source_type     text NOT NULL,                   -- "gmail"
  source_msg_ref  text NOT NULL,                   -- gmail message id
  triage_verdict  text NOT NULL,                   -- "relevant"
  triage_reason   text,                            -- short LLM reason for debug
  status          text NOT NULL DEFAULT 'queued',  -- queued | delivered | acked | expired
  triaged_at      timestamptz NOT NULL DEFAULT now(),
  delivered_at    timestamptz,
  acked_at        timestamptz,
  expires_at      timestamptz NOT NULL,            -- triaged_at + 30d
  UNIQUE (scout_id, source_msg_ref)                -- idempotent webhook retries
);
CREATE INDEX ON scout_triage_queue (user_id, status);
CREATE INDEX ON scout_triage_queue (expires_at) WHERE status != 'acked';

Alterations to `cloud_scout_configs` (Phase 2)

ALTER TABLE cloud_scout_configs ADD COLUMN auto_trash_spam boolean NOT NULL DEFAULT false;
ALTER TABLE cloud_scout_configs ADD COLUMN gmail_history_id text;
ALTER TABLE cloud_scout_configs ADD COLUMN gmail_watch_expires_at timestamptz;
ALTER TABLE cloud_scout_configs ADD COLUMN device_inactivity_pause_days int NOT NULL DEFAULT 14;

OAuth tokens continue to live in the existing cloud_scout_configs.oauth_token_encrypted column. Encryption mechanism (key derivation, rotation) is reused unchanged. A pre-implementation investigation step will document the current key-management story so we know the threat model; hardening, if needed, is out of scope.

SQLite (Electron, Drizzle)

Renames (Phase 1)

Before	After
`agent_runs`	`scout_runs`
`agent_run_actions`	`scout_run_actions`
Col `agent_id`	`scout_id`

New (Phase 2)

export const scoutSuggestions = sqliteTable('scout_suggestions', {
  id:                  text().primaryKey(),
  scoutId:             text().notNull(),
  sourceType:          text().notNull(),     // "gmail"
  sourceMsgRef:        text().notNull(),
  category:            text().notNull(),     // "unprocessed" until Phase 4
  payload:             text(),               // JSON, populated by Phase 4
  rawSubject:          text(),               // populated on delivery
  rawSnippet:          text(),               // populated on delivery
  status:              text().notNull(),     // pending | approved | rejected | expired
  proposedAt:          integer().notNull(),  // ms epoch
  resolvedAt:          integer(),
  resolvedEntityType:  text(),               // "task" | "project" | ... after Phase 4 approval
  resolvedEntityId:    text(),
});

rawSubject + rawSnippet are stored locally to render the HITL card without re-hitting Gmail every render. Body is still NOT stored — fetched on-demand via a tool call when the user explicitly opens the suggestion.

WebSocket Frame Contract

Existing /api/v1/device channel. Two new frame types.

// BE → Electron
{
  type: 'scout_proposal',
  proposal: {
    id: string,
    scoutId: string,
    sourceType: 'gmail',
    sourceMsgRef: string,
    rawSubject: string | null,
    rawSnippet: string | null,
    category: 'unprocessed',
    payload: null
  }
}

// Electron → BE
{ type: 'scout_proposal_ack', proposalId: string }

On WS reconnect, BE's ScoutEngine.deliver_pending(user_id, ws) selects all status='queued' rows for the user, calls connector.fetch_metadata per row (subject + snippet only), sends one scout_proposal frame each, and flips status='delivered' + sets delivered_at upon ack.

Stage 1 Triage Detail

Webhook (Pub/Sub) or cron tick
  -> ScoutEngine.trigger_scout(scout_id)
  -> if device inactive > N days: skip (pause)
  -> connector.list_new(scout) -> [ItemRef]
  -> for each ref:
       - if (scout_id, source_msg_ref) already in queue: skip (idempotent)
       - content = await connector.fetch_content(scout, ref)         # transient
       - verdict = await ScoutEngine._triage_llm(scout, content)     # gpt-4o-mini
       - if verdict == spam:
           - if scout.auto_trash_spam: connector.archive(...)
           - return                                                  # not queued
       - INSERT scout_triage_queue row
  -> UPDATE cloud_scout_configs.last_run_at
  -> INSERT scout_run_logs row

Triage LLM contract

Prompt name (Langfuse): scout-triage-system — source-agnostic, parameterized by source_type.
Input: {source_type, scout_name, scout_purpose, item_subject, item_sender, item_body_truncated_2k}.
Output (structured, Pydantic TriageVerdict): {verdict: "relevant" | "spam", reason: str, confidence: float}.
Cost guard: body truncated at 2k chars before LLM call.

Failure modes

LLM call fails: log error, leave message unprocessed, retry on next webhook/cron.
Gmail 401 (refresh exhausted): mark scout enabled=false, surface re-auth prompt to user via WS frame on next device connect.
Pub/Sub webhook unverified JWT: 401.

Gmail Push Setup

On scout enable: GmailConnector.setup_watch(scout) calls users.watch against a single project-wide Pub/Sub topic.
gmail_watch_expires_at stored. Watches expire after 7 days.
Weekly cron _scout_watch_renewal_tick re-issues watch for any scout whose expiry is within 24h.
Webhook route: POST /api/v1/scouts/webhooks/gmail. Verifies Pub/Sub-signed JWT, resolves user via the email address in the payload, enqueues triage job.
Cron fallback (_scout_cron_tick, runs each scout's schedule_cron): polls users.history.list since gmail_history_id, updates gmail_history_id after.

Terminology Refactor (Detail)

Renamed

Surface	Before	After
Settings nav	`settings.agents` "Agents"	`settings.scouts` "Scouts"
Subtitle/desc	`settings.agentsSubtitle`, `agentsDescription`	`settings.scoutsSubtitle`, `scoutsDescription`
`agents.*` keys	`noAgentsYet`, `createAgent`, `yourAgents`, etc.	`scouts.noScoutsYet`, `createScout`, `yourScouts`
`toast.agent.*`	`created`, `runStarted`, etc.	`toast.scout.*`
Components	`AgentsSection`, `AgentRow`, `LocalAgentConfigPanel`, `CloudAgentConfigPanel`, `InlineAgentCreationStepper`	`ScoutsSection`, `ScoutRow`, `LocalScoutConfigPanel`, `CloudScoutConfigPanel`, `InlineScoutCreationStepper`
TS types	`LocalAgentConfig`, `CloudAgentConfig`	`LocalScoutConfig`, `CloudScoutConfig`
tRPC router	`agent.local`, `agent.cloud`, `agent.journey`, `agent.runs`, `agent.runActions`	`scout.local`, `scout.cloud`, `scout.journey`, `scout.runs`, `scout.runActions`
Drizzle tables	`agent_runs`, `agent_run_actions`	`scout_runs`, `scout_run_actions`
Main process	`src/main/agents/agent-scheduler.ts`	`src/main/scouts/scout-scheduler.ts`
BE routes	`/api/v1/agents/*`, `/api/v1/agent-setup`	`/api/v1/scouts/*`, `/api/v1/scout-setup`
BE modules	`routes/agents.py`, `routes/agent_setup.py`, `core/agent_runner.py`, `core/agent_session_buffer.py`, `core/agent_registry.py`	`routes/scouts.py`, `routes/scout_setup.py`, `core/scout_runner.py`, `core/scout_session_buffer.py`, `core/scout_registry.py`
Postgres tables	`local_agent_configs`, `cloud_agent_configs`, `agent_run_logs`	`local_scout_configs`, `cloud_scout_configs`, `scout_run_logs`
Postgres columns	`agent_config`, `agent_id`, `agent_run_id`	`scout_config`, `scout_id`, `scout_run_id`
SQLAlchemy models	`LocalAgentConfig`, `CloudAgentConfig`, `AgentRunLog`	`LocalScoutConfig`, `CloudScoutConfig`, `ScoutRunLog`
Langfuse prompts	user-facing scout prompts named `agent-*`	recreate as `scout-*`; delete old
i18n	5 langs (en/it/es/fr/de)	all updated atomically

Kept as-is

app/agents/* Python module — these are LLM helper agents (task_agent, project_agent, note_agent, timeline_agent, filesystem_agent) invoked internally by deep_agent. Different concept from user-facing scouts. Renaming would create semantic clash with LLM-agent terminology.
/api/v1/device WS endpoint name (already source-neutral).
All tool_call, run_complete, etc. WS frame types unrelated to scouts.

Phasing

Phase 1 — Rename only

Single PR. Single Alembic migration. Single Drizzle migration.
All renames listed above land together. App still works, existing local scout still runs. No new behavior.

Phase 2 — Connector abstraction skeleton

New module app/scouts/connectors/{base,registry,gmail}.py.
New module app/scouts/engine.py.
New table scout_triage_queue + alterations to cloud_scout_configs.
New SQLite table scout_suggestions (Drizzle).
New WS frame types scout_proposal + scout_proposal_ack.
No user-facing change yet.

Phase 3 — Gmail scout end-to-end

Settings UI: "Add Gmail scout" → OAuth consent (separate scope set: gmail.readonly + gmail.modify) → encrypted token stored in cloud_scout_configs.oauth_token_encrypted → save scout config.
Pub/Sub topic + webhook route + JWT verify.
setup_watch on enable; weekly renew_watch cron.
Cron-fallback _scout_cron_tick per scout.
Triage LLM (gpt-4o-mini, Langfuse scout-triage-system).
Spam auto-trash toggle (default off) per scout.
Device-inactivity pause logic.
WS deliver-on-reconnect drains queue → scout_proposal frames → ack handler → SQLite scout_suggestions insert with category='unprocessed' (Phase 4 swaps real categorization in).
"Read full email" tool call: Electron requests body for a suggestion → BE GmailConnector.fetch_content → returns body transiently in tool result.

Phase 4 — Deferred (separate spec, with task-brief rework)

Stage 2 categorization agent (prompt + tool palette: list_projects, list_tasks, search_notes, memory).
HITL UI surface in the brief: suggestion cards, approve/reject controls, "convert to task | event | note | project | actionable-only" actions.
list_pending_scout_suggestions brief tool.
Convert-to-entity mutations.
Future connectors (Slack/Teams/Outlook/...).

Testing Surface

Phase 1: existing pytest suite still green with renamed identifiers (auth, ws_unified, schemas, models, etc.). UI smoke: settings page renders, existing local scout runs.
Phase 2: unit tests for ScoutEngine w/ mocked SourceConnector. Idempotency test (replay same source_msg_ref).
Phase 3: integration tests for Gmail webhook → triage → queue insertion (mocked GmailConnector for content fetch and LLM). E2E (manual): connect a real Gmail account on dev, send an email, observe queue row appear, reconnect device, observe scout_suggestions row land with subject/snippet.

Open Questions (none blocking)

OAuth-token encryption key derivation (app-global vs user-derived) — investigation step in implementation plan; document current state, security hardening is out of scope.
Pub/Sub topic naming and IAM setup (one topic project-wide vs per-environment) — operational detail to decide during Phase 3.

Risks

Pub/Sub setup is per-Google-Cloud-project and requires console IAM grants — first-time setup friction.
Gmail users.watch quota: 1 watch per user. We use one watch per scout, but a user has only one Gmail scout per Gmail account, so this is fine.
_pending_states dict pattern in existing OAuth flow is in-memory — Pub/Sub webhook can run on any worker, so any cross-request state must be in DB, not in-memory. This design uses no in-memory state; safe.

Acceptance

All renames land atomically; app boots; existing local scout still operates.
A user can connect Gmail through the Scouts settings page, see the scout marked enabled, send themselves a test email, and observe a scout_suggestions row appear in their local DB with category='unprocessed', rawSubject, and rawSnippet populated, after the next WS reconnect.
Spam emails (per LLM triage) are not queued; if auto_trash_spam=true they appear in Gmail Trash.
BE never persists email bodies. Verified by code review of triage flow + grep for body_text writes.

19 KiB Raw Blame History Unescape Escape

Scouts Refactor + Gmail Integration — Design

Summary

Goals

Non-Goals (Phase 4, separate spec)

Constraints

Architecture

Two-stage pipeline (cloud scouts only)

Local scouts (unchanged behaviorally)

SourceConnector abstraction

ScoutEngine

Data Model

Postgres (BE)

Renames (Phase 1, single Alembic migration)

New (Phase 2)

Alterations to cloud_scout_configs (Phase 2)

SQLite (Electron, Drizzle)

Renames (Phase 1)

New (Phase 2)

WebSocket Frame Contract

Stage 1 Triage Detail

Triage LLM contract

Failure modes

Gmail Push Setup

Terminology Refactor (Detail)

Renamed

Kept as-is

Phasing

Phase 1 — Rename only

Phase 2 — Connector abstraction skeleton

Phase 3 — Gmail scout end-to-end

Phase 4 — Deferred (separate spec, with task-brief rework)

Testing Surface

Open Questions (none blocking)

Risks

Acceptance

19 KiB

Raw Blame History

Alterations to `cloud_scout_configs` (Phase 2)