Files
workspace/docs/superpowers/specs/2026-05-15-scouts-refactor-and-gmail-integration-design.md
Roberto 732a4c42f8 docs: add scouts refactor + gmail scout design spec
Phases 1-3 in scope: rename agents → scouts (UI/code/Postgres/SQLite/
Langfuse), Gmail cloud scout w/ two-stage pipeline, SourceConnector
abstraction. Phase 4 (Stage 2 categorization + HITL surface in brief)
deferred to task-brief rework.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 23:15:46 +02:00

19 KiB
Raw Blame History

Scouts Refactor + Gmail Integration — Design

Date: 2026-05-15 Status: Draft, awaiting user review Owner: Roberto

Summary

Rename the existing "Agents" subsystem to "Scouts" across the entire stack (UI, code, Postgres, SQLite, Langfuse), then add the first cloud scout — Gmail — using a two-stage pipeline that respects zero-trust (no email content stored on backend) and human-in-the-loop (no entities created autonomously).

The implementation is split into four phases. Phases 13 ship now. Phase 4 (Stage 2 categorization, HITL surface in the brief, conversion-to-entity mutations) is deferred to the planned task-brief rework.

Goals

  • Unify the user-facing "data source watchers" concept under one name: Scout.
  • Land a SourceConnector abstraction so future cloud scouts (Slack/Teams/Outlook/RSS/...) reuse the same engine, queue, delivery channel, and HITL surface — only the per-source connector is new.
  • Ship a Gmail scout end-to-end with: OAuth, push (users.watch) + cron-fallback polling, BE-side spam triage, encrypted token storage, opt-in spam auto-trash.
  • Preserve zero-trust: Gmail bodies are fetched transiently for the triage LLM call and discarded; only {message_id, scout_id, verdict, status} is persisted on BE.
  • Preserve HITL on the cloud path: scouts never create tasks/projects/events/notes autonomously; they accumulate proposals that the user resolves later from the brief.

Non-Goals (Phase 4, separate spec)

  • Stage 2 categorization agent prompt + tool palette.
  • HITL UI in the task brief (suggestion cards, approve/reject controls, convert-to-entity mutations, list_pending_scout_suggestions brief tool).
  • Local scout behavior change. Local directory monitor keeps current "auto-create" semantics. HITL is opt-in for local scouts in a future migration.
  • Schema unification of LocalScoutConfig + CloudScoutConfig. They have different behaviors; keep separate tables.
  • Connectors other than Gmail (Slack/Teams/Outlook).
  • Stripe/billing changes (existing tier checks suffice).

Constraints

  • Pre-1.0 dev: no production users, no backwards-compatibility shims, no Alembic data migrations beyond rename. Drop-and-recreate is acceptable where simpler.
  • Zero-trust: BE never persists user content. Gmail bodies are read transiently for the triage LLM call only.
  • HITL (cloud path): scouts produce proposals, never entities.
  • Spam auto-trash: off by default per scout; opt-in via UI toggle. Action is "move to Trash" (Gmail's 30d recovery), never permanent delete.
  • Reusability: cloud-scout pipeline (connector → triage → queue → deliver-on-connect → HITL) is shared infra; Gmail is just the first connector.

Architecture

Two-stage pipeline (cloud scouts only)

[Gmail] --push/cron--> [BE Stage 1: Triage]                  [Electron Stage 2: Categorize]
                          |                                     |
                          v                                     v
                       fetch body (transient)               drain queue on WS reconnect
                          |                                     |
                          v                                     v
                       LLM relevance call                   fetch metadata for each msg
                          |                                     |
                          +-- spam + auto_trash_spam: archive   v
                          |                                  insert scout_suggestions row
                          +-- relevant: insert queue row     (category='unprocessed' stub
                                                              until Phase 4)

Stage 1 (BE, always-on): verdict only. Stores {msg_id, verdict, status}. No content.

Stage 2 (Electron, on connect): Phase 3 ships a stub that simply mirrors the queue into a local SQLite table with category='unprocessed'. Phase 4 swaps in the real categorization agent.

Local scouts (unchanged behaviorally)

Local directory monitor keeps current Electron-side scheduling and auto-creation. Only renames apply.

SourceConnector abstraction

A SourceConnector Protocol owns all source-specific I/O. The shared ScoutEngine owns triage, queueing, delivery, and ack handling. To add a new cloud scout: implement one connector class + register it.

# app/scouts/connectors/base.py
class SourceConnector(Protocol):
    source_type: str  # "gmail"

    async def list_new(self, scout: CloudScoutConfig) -> list[ItemRef]: ...
    async def fetch_metadata(self, scout: CloudScoutConfig, ref: ItemRef) -> ItemMetadata: ...
    async def fetch_content(self, scout: CloudScoutConfig, ref: ItemRef) -> ItemContent: ...
    async def archive(self, scout: CloudScoutConfig, ref: ItemRef) -> None: ...
    async def setup_watch(self, scout: CloudScoutConfig) -> None: ...
    async def renew_watch(self, scout: CloudScoutConfig) -> None: ...

ItemContent.body_text is in-memory only; never persisted.

ScoutEngine

class ScoutEngine:
    async def trigger_scout(self, scout_id: UUID) -> None: ...
    async def _process_item(self, scout, connector, ref) -> None: ...
    async def deliver_pending(self, user_id: UUID, ws: DeviceWS) -> None: ...

Both webhook and cron-fallback entry points call trigger_scout.

Data Model

Postgres (BE)

Renames (Phase 1, single Alembic migration)

Before After
Table local_agent_configs local_scout_configs
Table cloud_agent_configs cloud_scout_configs
Table agent_run_logs scout_run_logs
Column agent_config scout_config
Column agent_id (FKs) scout_id
Column agent_run_id scout_run_id
Class LocalAgentConfig LocalScoutConfig
Class CloudAgentConfig CloudScoutConfig
Class AgentRunLog ScoutRunLog

New (Phase 2)

CREATE TABLE scout_triage_queue (
  id              uuid PRIMARY KEY,
  user_id         uuid NOT NULL REFERENCES users(id),
  scout_id        uuid NOT NULL REFERENCES cloud_scout_configs(id),
  source_type     text NOT NULL,                   -- "gmail"
  source_msg_ref  text NOT NULL,                   -- gmail message id
  triage_verdict  text NOT NULL,                   -- "relevant"
  triage_reason   text,                            -- short LLM reason for debug
  status          text NOT NULL DEFAULT 'queued',  -- queued | delivered | acked | expired
  triaged_at      timestamptz NOT NULL DEFAULT now(),
  delivered_at    timestamptz,
  acked_at        timestamptz,
  expires_at      timestamptz NOT NULL,            -- triaged_at + 30d
  UNIQUE (scout_id, source_msg_ref)                -- idempotent webhook retries
);
CREATE INDEX ON scout_triage_queue (user_id, status);
CREATE INDEX ON scout_triage_queue (expires_at) WHERE status != 'acked';

Alterations to cloud_scout_configs (Phase 2)

ALTER TABLE cloud_scout_configs ADD COLUMN auto_trash_spam boolean NOT NULL DEFAULT false;
ALTER TABLE cloud_scout_configs ADD COLUMN gmail_history_id text;
ALTER TABLE cloud_scout_configs ADD COLUMN gmail_watch_expires_at timestamptz;
ALTER TABLE cloud_scout_configs ADD COLUMN device_inactivity_pause_days int NOT NULL DEFAULT 14;

OAuth tokens continue to live in the existing cloud_scout_configs.oauth_token_encrypted column. Encryption mechanism (key derivation, rotation) is reused unchanged. A pre-implementation investigation step will document the current key-management story so we know the threat model; hardening, if needed, is out of scope.

SQLite (Electron, Drizzle)

Renames (Phase 1)

Before After
agent_runs scout_runs
agent_run_actions scout_run_actions
Col agent_id scout_id

New (Phase 2)

export const scoutSuggestions = sqliteTable('scout_suggestions', {
  id:                  text().primaryKey(),
  scoutId:             text().notNull(),
  sourceType:          text().notNull(),     // "gmail"
  sourceMsgRef:        text().notNull(),
  category:            text().notNull(),     // "unprocessed" until Phase 4
  payload:             text(),               // JSON, populated by Phase 4
  rawSubject:          text(),               // populated on delivery
  rawSnippet:          text(),               // populated on delivery
  status:              text().notNull(),     // pending | approved | rejected | expired
  proposedAt:          integer().notNull(),  // ms epoch
  resolvedAt:          integer(),
  resolvedEntityType:  text(),               // "task" | "project" | ... after Phase 4 approval
  resolvedEntityId:    text(),
});

rawSubject + rawSnippet are stored locally to render the HITL card without re-hitting Gmail every render. Body is still NOT stored — fetched on-demand via a tool call when the user explicitly opens the suggestion.

WebSocket Frame Contract

Existing /api/v1/device channel. Two new frame types.

// BE → Electron
{
  type: 'scout_proposal',
  proposal: {
    id: string,
    scoutId: string,
    sourceType: 'gmail',
    sourceMsgRef: string,
    rawSubject: string | null,
    rawSnippet: string | null,
    category: 'unprocessed',
    payload: null
  }
}

// Electron → BE
{ type: 'scout_proposal_ack', proposalId: string }

On WS reconnect, BE's ScoutEngine.deliver_pending(user_id, ws) selects all status='queued' rows for the user, calls connector.fetch_metadata per row (subject + snippet only), sends one scout_proposal frame each, and flips status='delivered' + sets delivered_at upon ack.

Stage 1 Triage Detail

Webhook (Pub/Sub) or cron tick
  -> ScoutEngine.trigger_scout(scout_id)
  -> if device inactive > N days: skip (pause)
  -> connector.list_new(scout) -> [ItemRef]
  -> for each ref:
       - if (scout_id, source_msg_ref) already in queue: skip (idempotent)
       - content = await connector.fetch_content(scout, ref)         # transient
       - verdict = await ScoutEngine._triage_llm(scout, content)     # gpt-4o-mini
       - if verdict == spam:
           - if scout.auto_trash_spam: connector.archive(...)
           - return                                                  # not queued
       - INSERT scout_triage_queue row
  -> UPDATE cloud_scout_configs.last_run_at
  -> INSERT scout_run_logs row

Triage LLM contract

  • Prompt name (Langfuse): scout-triage-system — source-agnostic, parameterized by source_type.
  • Input: {source_type, scout_name, scout_purpose, item_subject, item_sender, item_body_truncated_2k}.
  • Output (structured, Pydantic TriageVerdict): {verdict: "relevant" | "spam", reason: str, confidence: float}.
  • Cost guard: body truncated at 2k chars before LLM call.

Failure modes

  • LLM call fails: log error, leave message unprocessed, retry on next webhook/cron.
  • Gmail 401 (refresh exhausted): mark scout enabled=false, surface re-auth prompt to user via WS frame on next device connect.
  • Pub/Sub webhook unverified JWT: 401.

Gmail Push Setup

  • On scout enable: GmailConnector.setup_watch(scout) calls users.watch against a single project-wide Pub/Sub topic.
  • gmail_watch_expires_at stored. Watches expire after 7 days.
  • Weekly cron _scout_watch_renewal_tick re-issues watch for any scout whose expiry is within 24h.
  • Webhook route: POST /api/v1/scouts/webhooks/gmail. Verifies Pub/Sub-signed JWT, resolves user via the email address in the payload, enqueues triage job.
  • Cron fallback (_scout_cron_tick, runs each scout's schedule_cron): polls users.history.list since gmail_history_id, updates gmail_history_id after.

Terminology Refactor (Detail)

Renamed

Surface Before After
Settings nav settings.agents "Agents" settings.scouts "Scouts"
Subtitle/desc settings.agentsSubtitle, agentsDescription settings.scoutsSubtitle, scoutsDescription
agents.* keys noAgentsYet, createAgent, yourAgents, etc. scouts.noScoutsYet, createScout, yourScouts
toast.agent.* created, runStarted, etc. toast.scout.*
Components AgentsSection, AgentRow, LocalAgentConfigPanel, CloudAgentConfigPanel, InlineAgentCreationStepper ScoutsSection, ScoutRow, LocalScoutConfigPanel, CloudScoutConfigPanel, InlineScoutCreationStepper
TS types LocalAgentConfig, CloudAgentConfig LocalScoutConfig, CloudScoutConfig
tRPC router agent.local, agent.cloud, agent.journey, agent.runs, agent.runActions scout.local, scout.cloud, scout.journey, scout.runs, scout.runActions
Drizzle tables agent_runs, agent_run_actions scout_runs, scout_run_actions
Main process src/main/agents/agent-scheduler.ts src/main/scouts/scout-scheduler.ts
BE routes /api/v1/agents/*, /api/v1/agent-setup /api/v1/scouts/*, /api/v1/scout-setup
BE modules routes/agents.py, routes/agent_setup.py, core/agent_runner.py, core/agent_session_buffer.py, core/agent_registry.py routes/scouts.py, routes/scout_setup.py, core/scout_runner.py, core/scout_session_buffer.py, core/scout_registry.py
Postgres tables local_agent_configs, cloud_agent_configs, agent_run_logs local_scout_configs, cloud_scout_configs, scout_run_logs
Postgres columns agent_config, agent_id, agent_run_id scout_config, scout_id, scout_run_id
SQLAlchemy models LocalAgentConfig, CloudAgentConfig, AgentRunLog LocalScoutConfig, CloudScoutConfig, ScoutRunLog
Langfuse prompts user-facing scout prompts named agent-* recreate as scout-*; delete old
i18n 5 langs (en/it/es/fr/de) all updated atomically

Kept as-is

  • app/agents/* Python module — these are LLM helper agents (task_agent, project_agent, note_agent, timeline_agent, filesystem_agent) invoked internally by deep_agent. Different concept from user-facing scouts. Renaming would create semantic clash with LLM-agent terminology.
  • /api/v1/device WS endpoint name (already source-neutral).
  • All tool_call, run_complete, etc. WS frame types unrelated to scouts.

Phasing

Phase 1 — Rename only

  • Single PR. Single Alembic migration. Single Drizzle migration.
  • All renames listed above land together. App still works, existing local scout still runs. No new behavior.

Phase 2 — Connector abstraction skeleton

  • New module app/scouts/connectors/{base,registry,gmail}.py.
  • New module app/scouts/engine.py.
  • New table scout_triage_queue + alterations to cloud_scout_configs.
  • New SQLite table scout_suggestions (Drizzle).
  • New WS frame types scout_proposal + scout_proposal_ack.
  • No user-facing change yet.

Phase 3 — Gmail scout end-to-end

  • Settings UI: "Add Gmail scout" → OAuth consent (separate scope set: gmail.readonly + gmail.modify) → encrypted token stored in cloud_scout_configs.oauth_token_encrypted → save scout config.
  • Pub/Sub topic + webhook route + JWT verify.
  • setup_watch on enable; weekly renew_watch cron.
  • Cron-fallback _scout_cron_tick per scout.
  • Triage LLM (gpt-4o-mini, Langfuse scout-triage-system).
  • Spam auto-trash toggle (default off) per scout.
  • Device-inactivity pause logic.
  • WS deliver-on-reconnect drains queue → scout_proposal frames → ack handler → SQLite scout_suggestions insert with category='unprocessed' (Phase 4 swaps real categorization in).
  • "Read full email" tool call: Electron requests body for a suggestion → BE GmailConnector.fetch_content → returns body transiently in tool result.

Phase 4 — Deferred (separate spec, with task-brief rework)

  • Stage 2 categorization agent (prompt + tool palette: list_projects, list_tasks, search_notes, memory).
  • HITL UI surface in the brief: suggestion cards, approve/reject controls, "convert to task | event | note | project | actionable-only" actions.
  • list_pending_scout_suggestions brief tool.
  • Convert-to-entity mutations.
  • Future connectors (Slack/Teams/Outlook/...).

Testing Surface

  • Phase 1: existing pytest suite still green with renamed identifiers (auth, ws_unified, schemas, models, etc.). UI smoke: settings page renders, existing local scout runs.
  • Phase 2: unit tests for ScoutEngine w/ mocked SourceConnector. Idempotency test (replay same source_msg_ref).
  • Phase 3: integration tests for Gmail webhook → triage → queue insertion (mocked GmailConnector for content fetch and LLM). E2E (manual): connect a real Gmail account on dev, send an email, observe queue row appear, reconnect device, observe scout_suggestions row land with subject/snippet.

Open Questions (none blocking)

  • OAuth-token encryption key derivation (app-global vs user-derived) — investigation step in implementation plan; document current state, security hardening is out of scope.
  • Pub/Sub topic naming and IAM setup (one topic project-wide vs per-environment) — operational detail to decide during Phase 3.

Risks

  • Pub/Sub setup is per-Google-Cloud-project and requires console IAM grants — first-time setup friction.
  • Gmail users.watch quota: 1 watch per user. We use one watch per scout, but a user has only one Gmail scout per Gmail account, so this is fine.
  • _pending_states dict pattern in existing OAuth flow is in-memory — Pub/Sub webhook can run on any worker, so any cross-request state must be in DB, not in-memory. This design uses no in-memory state; safe.

Acceptance

  • All renames land atomically; app boots; existing local scout still operates.
  • A user can connect Gmail through the Scouts settings page, see the scout marked enabled, send themselves a test email, and observe a scout_suggestions row appear in their local DB with category='unprocessed', rawSubject, and rawSnippet populated, after the next WS reconnect.
  • Spam emails (per LLM triage) are not queued; if auto_trash_spam=true they appear in Gmail Trash.
  • BE never persists email bodies. Verified by code review of triage flow + grep for body_text writes.