Files
workspace/docs/superpowers/specs/2026-05-15-scouts-refactor-and-gmail-integration-design.md
Roberto 732a4c42f8 docs: add scouts refactor + gmail scout design spec
Phases 1-3 in scope: rename agents → scouts (UI/code/Postgres/SQLite/
Langfuse), Gmail cloud scout w/ two-stage pipeline, SourceConnector
abstraction. Phase 4 (Stage 2 categorization + HITL surface in brief)
deferred to task-brief rework.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 23:15:46 +02:00

328 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Scouts Refactor + Gmail Integration — Design
**Date:** 2026-05-15
**Status:** Draft, awaiting user review
**Owner:** Roberto
## Summary
Rename the existing "Agents" subsystem to "Scouts" across the entire stack (UI, code, Postgres, SQLite, Langfuse), then add the first cloud scout — Gmail — using a two-stage pipeline that respects zero-trust (no email content stored on backend) and human-in-the-loop (no entities created autonomously).
The implementation is split into four phases. Phases 13 ship now. Phase 4 (Stage 2 categorization, HITL surface in the brief, conversion-to-entity mutations) is deferred to the planned task-brief rework.
## Goals
- Unify the user-facing "data source watchers" concept under one name: **Scout**.
- Land a `SourceConnector` abstraction so future cloud scouts (Slack/Teams/Outlook/RSS/...) reuse the same engine, queue, delivery channel, and HITL surface — only the per-source connector is new.
- Ship a Gmail scout end-to-end with: OAuth, push (`users.watch`) + cron-fallback polling, BE-side spam triage, encrypted token storage, opt-in spam auto-trash.
- Preserve zero-trust: Gmail bodies are fetched transiently for the triage LLM call and discarded; only `{message_id, scout_id, verdict, status}` is persisted on BE.
- Preserve HITL on the cloud path: scouts never create tasks/projects/events/notes autonomously; they accumulate proposals that the user resolves later from the brief.
## Non-Goals (Phase 4, separate spec)
- Stage 2 categorization agent prompt + tool palette.
- HITL UI in the task brief (suggestion cards, approve/reject controls, convert-to-entity mutations, `list_pending_scout_suggestions` brief tool).
- Local scout behavior change. Local directory monitor keeps current "auto-create" semantics. HITL is opt-in for local scouts in a future migration.
- Schema unification of `LocalScoutConfig` + `CloudScoutConfig`. They have different behaviors; keep separate tables.
- Connectors other than Gmail (Slack/Teams/Outlook).
- Stripe/billing changes (existing tier checks suffice).
## Constraints
- **Pre-1.0 dev**: no production users, no backwards-compatibility shims, no Alembic data migrations beyond rename. Drop-and-recreate is acceptable where simpler.
- **Zero-trust**: BE never persists user content. Gmail bodies are read transiently for the triage LLM call only.
- **HITL (cloud path)**: scouts produce proposals, never entities.
- **Spam auto-trash**: off by default per scout; opt-in via UI toggle. Action is "move to Trash" (Gmail's 30d recovery), never permanent delete.
- **Reusability**: cloud-scout pipeline (connector → triage → queue → deliver-on-connect → HITL) is shared infra; Gmail is just the first connector.
## Architecture
### Two-stage pipeline (cloud scouts only)
```
[Gmail] --push/cron--> [BE Stage 1: Triage] [Electron Stage 2: Categorize]
| |
v v
fetch body (transient) drain queue on WS reconnect
| |
v v
LLM relevance call fetch metadata for each msg
| |
+-- spam + auto_trash_spam: archive v
| insert scout_suggestions row
+-- relevant: insert queue row (category='unprocessed' stub
until Phase 4)
```
**Stage 1 (BE, always-on):** verdict only. Stores `{msg_id, verdict, status}`. No content.
**Stage 2 (Electron, on connect):** Phase 3 ships a stub that simply mirrors the queue into a local SQLite table with `category='unprocessed'`. Phase 4 swaps in the real categorization agent.
### Local scouts (unchanged behaviorally)
Local directory monitor keeps current Electron-side scheduling and auto-creation. Only renames apply.
### SourceConnector abstraction
A `SourceConnector` Protocol owns all source-specific I/O. The shared `ScoutEngine` owns triage, queueing, delivery, and ack handling. To add a new cloud scout: implement one connector class + register it.
```python
# app/scouts/connectors/base.py
class SourceConnector(Protocol):
source_type: str # "gmail"
async def list_new(self, scout: CloudScoutConfig) -> list[ItemRef]: ...
async def fetch_metadata(self, scout: CloudScoutConfig, ref: ItemRef) -> ItemMetadata: ...
async def fetch_content(self, scout: CloudScoutConfig, ref: ItemRef) -> ItemContent: ...
async def archive(self, scout: CloudScoutConfig, ref: ItemRef) -> None: ...
async def setup_watch(self, scout: CloudScoutConfig) -> None: ...
async def renew_watch(self, scout: CloudScoutConfig) -> None: ...
```
`ItemContent.body_text` is in-memory only; never persisted.
### ScoutEngine
```python
class ScoutEngine:
async def trigger_scout(self, scout_id: UUID) -> None: ...
async def _process_item(self, scout, connector, ref) -> None: ...
async def deliver_pending(self, user_id: UUID, ws: DeviceWS) -> None: ...
```
Both webhook and cron-fallback entry points call `trigger_scout`.
## Data Model
### Postgres (BE)
#### Renames (Phase 1, single Alembic migration)
| Before | After |
|--------------------------------|--------------------------------|
| Table `local_agent_configs` | `local_scout_configs` |
| Table `cloud_agent_configs` | `cloud_scout_configs` |
| Table `agent_run_logs` | `scout_run_logs` |
| Column `agent_config` | `scout_config` |
| Column `agent_id` (FKs) | `scout_id` |
| Column `agent_run_id` | `scout_run_id` |
| Class `LocalAgentConfig` | `LocalScoutConfig` |
| Class `CloudAgentConfig` | `CloudScoutConfig` |
| Class `AgentRunLog` | `ScoutRunLog` |
#### New (Phase 2)
```sql
CREATE TABLE scout_triage_queue (
id uuid PRIMARY KEY,
user_id uuid NOT NULL REFERENCES users(id),
scout_id uuid NOT NULL REFERENCES cloud_scout_configs(id),
source_type text NOT NULL, -- "gmail"
source_msg_ref text NOT NULL, -- gmail message id
triage_verdict text NOT NULL, -- "relevant"
triage_reason text, -- short LLM reason for debug
status text NOT NULL DEFAULT 'queued', -- queued | delivered | acked | expired
triaged_at timestamptz NOT NULL DEFAULT now(),
delivered_at timestamptz,
acked_at timestamptz,
expires_at timestamptz NOT NULL, -- triaged_at + 30d
UNIQUE (scout_id, source_msg_ref) -- idempotent webhook retries
);
CREATE INDEX ON scout_triage_queue (user_id, status);
CREATE INDEX ON scout_triage_queue (expires_at) WHERE status != 'acked';
```
#### Alterations to `cloud_scout_configs` (Phase 2)
```sql
ALTER TABLE cloud_scout_configs ADD COLUMN auto_trash_spam boolean NOT NULL DEFAULT false;
ALTER TABLE cloud_scout_configs ADD COLUMN gmail_history_id text;
ALTER TABLE cloud_scout_configs ADD COLUMN gmail_watch_expires_at timestamptz;
ALTER TABLE cloud_scout_configs ADD COLUMN device_inactivity_pause_days int NOT NULL DEFAULT 14;
```
OAuth tokens continue to live in the existing `cloud_scout_configs.oauth_token_encrypted` column. Encryption mechanism (key derivation, rotation) is reused unchanged. A pre-implementation investigation step will document the current key-management story so we know the threat model; hardening, if needed, is out of scope.
### SQLite (Electron, Drizzle)
#### Renames (Phase 1)
| Before | After |
|---------------------|---------------------|
| `agent_runs` | `scout_runs` |
| `agent_run_actions` | `scout_run_actions` |
| Col `agent_id` | `scout_id` |
#### New (Phase 2)
```typescript
export const scoutSuggestions = sqliteTable('scout_suggestions', {
id: text().primaryKey(),
scoutId: text().notNull(),
sourceType: text().notNull(), // "gmail"
sourceMsgRef: text().notNull(),
category: text().notNull(), // "unprocessed" until Phase 4
payload: text(), // JSON, populated by Phase 4
rawSubject: text(), // populated on delivery
rawSnippet: text(), // populated on delivery
status: text().notNull(), // pending | approved | rejected | expired
proposedAt: integer().notNull(), // ms epoch
resolvedAt: integer(),
resolvedEntityType: text(), // "task" | "project" | ... after Phase 4 approval
resolvedEntityId: text(),
});
```
`rawSubject` + `rawSnippet` are stored locally to render the HITL card without re-hitting Gmail every render. Body is still NOT stored — fetched on-demand via a tool call when the user explicitly opens the suggestion.
## WebSocket Frame Contract
Existing `/api/v1/device` channel. Two new frame types.
```typescript
// BE → Electron
{
type: 'scout_proposal',
proposal: {
id: string,
scoutId: string,
sourceType: 'gmail',
sourceMsgRef: string,
rawSubject: string | null,
rawSnippet: string | null,
category: 'unprocessed',
payload: null
}
}
// Electron → BE
{ type: 'scout_proposal_ack', proposalId: string }
```
On WS reconnect, BE's `ScoutEngine.deliver_pending(user_id, ws)` selects all `status='queued'` rows for the user, calls `connector.fetch_metadata` per row (subject + snippet only), sends one `scout_proposal` frame each, and flips `status='delivered'` + sets `delivered_at` upon ack.
## Stage 1 Triage Detail
```
Webhook (Pub/Sub) or cron tick
-> ScoutEngine.trigger_scout(scout_id)
-> if device inactive > N days: skip (pause)
-> connector.list_new(scout) -> [ItemRef]
-> for each ref:
- if (scout_id, source_msg_ref) already in queue: skip (idempotent)
- content = await connector.fetch_content(scout, ref) # transient
- verdict = await ScoutEngine._triage_llm(scout, content) # gpt-4o-mini
- if verdict == spam:
- if scout.auto_trash_spam: connector.archive(...)
- return # not queued
- INSERT scout_triage_queue row
-> UPDATE cloud_scout_configs.last_run_at
-> INSERT scout_run_logs row
```
### Triage LLM contract
- **Prompt name (Langfuse):** `scout-triage-system` — source-agnostic, parameterized by `source_type`.
- **Input:** `{source_type, scout_name, scout_purpose, item_subject, item_sender, item_body_truncated_2k}`.
- **Output (structured, Pydantic `TriageVerdict`):** `{verdict: "relevant" | "spam", reason: str, confidence: float}`.
- **Cost guard:** body truncated at 2k chars before LLM call.
### Failure modes
- LLM call fails: log error, leave message unprocessed, retry on next webhook/cron.
- Gmail 401 (refresh exhausted): mark scout `enabled=false`, surface re-auth prompt to user via WS frame on next device connect.
- Pub/Sub webhook unverified JWT: 401.
## Gmail Push Setup
- On scout enable: `GmailConnector.setup_watch(scout)` calls `users.watch` against a single project-wide Pub/Sub topic.
- `gmail_watch_expires_at` stored. Watches expire after 7 days.
- Weekly cron `_scout_watch_renewal_tick` re-issues `watch` for any scout whose expiry is within 24h.
- Webhook route: `POST /api/v1/scouts/webhooks/gmail`. Verifies Pub/Sub-signed JWT, resolves user via the email address in the payload, enqueues triage job.
- Cron fallback (`_scout_cron_tick`, runs each scout's `schedule_cron`): polls `users.history.list` since `gmail_history_id`, updates `gmail_history_id` after.
## Terminology Refactor (Detail)
### Renamed
| Surface | Before | After |
|-------------------|-----------------------------------------------------|-----------------------------------------------------|
| Settings nav | `settings.agents` "Agents" | `settings.scouts` "Scouts" |
| Subtitle/desc | `settings.agentsSubtitle`, `agentsDescription` | `settings.scoutsSubtitle`, `scoutsDescription` |
| `agents.*` keys | `noAgentsYet`, `createAgent`, `yourAgents`, etc. | `scouts.noScoutsYet`, `createScout`, `yourScouts` |
| `toast.agent.*` | `created`, `runStarted`, etc. | `toast.scout.*` |
| Components | `AgentsSection`, `AgentRow`, `LocalAgentConfigPanel`, `CloudAgentConfigPanel`, `InlineAgentCreationStepper` | `ScoutsSection`, `ScoutRow`, `LocalScoutConfigPanel`, `CloudScoutConfigPanel`, `InlineScoutCreationStepper` |
| TS types | `LocalAgentConfig`, `CloudAgentConfig` | `LocalScoutConfig`, `CloudScoutConfig` |
| tRPC router | `agent.local`, `agent.cloud`, `agent.journey`, `agent.runs`, `agent.runActions` | `scout.local`, `scout.cloud`, `scout.journey`, `scout.runs`, `scout.runActions` |
| Drizzle tables | `agent_runs`, `agent_run_actions` | `scout_runs`, `scout_run_actions` |
| Main process | `src/main/agents/agent-scheduler.ts` | `src/main/scouts/scout-scheduler.ts` |
| BE routes | `/api/v1/agents/*`, `/api/v1/agent-setup` | `/api/v1/scouts/*`, `/api/v1/scout-setup` |
| BE modules | `routes/agents.py`, `routes/agent_setup.py`, `core/agent_runner.py`, `core/agent_session_buffer.py`, `core/agent_registry.py` | `routes/scouts.py`, `routes/scout_setup.py`, `core/scout_runner.py`, `core/scout_session_buffer.py`, `core/scout_registry.py` |
| Postgres tables | `local_agent_configs`, `cloud_agent_configs`, `agent_run_logs` | `local_scout_configs`, `cloud_scout_configs`, `scout_run_logs` |
| Postgres columns | `agent_config`, `agent_id`, `agent_run_id` | `scout_config`, `scout_id`, `scout_run_id` |
| SQLAlchemy models | `LocalAgentConfig`, `CloudAgentConfig`, `AgentRunLog` | `LocalScoutConfig`, `CloudScoutConfig`, `ScoutRunLog` |
| Langfuse prompts | user-facing scout prompts named `agent-*` | recreate as `scout-*`; delete old |
| i18n | 5 langs (en/it/es/fr/de) | all updated atomically |
### Kept as-is
- `app/agents/*` Python module — these are LLM helper agents (task_agent, project_agent, note_agent, timeline_agent, filesystem_agent) invoked internally by `deep_agent`. Different concept from user-facing scouts. Renaming would create semantic clash with LLM-agent terminology.
- `/api/v1/device` WS endpoint name (already source-neutral).
- All `tool_call`, `run_complete`, etc. WS frame types unrelated to scouts.
## Phasing
### Phase 1 — Rename only
- Single PR. Single Alembic migration. Single Drizzle migration.
- All renames listed above land together. App still works, existing local scout still runs. No new behavior.
### Phase 2 — Connector abstraction skeleton
- New module `app/scouts/connectors/{base,registry,gmail}.py`.
- New module `app/scouts/engine.py`.
- New table `scout_triage_queue` + alterations to `cloud_scout_configs`.
- New SQLite table `scout_suggestions` (Drizzle).
- New WS frame types `scout_proposal` + `scout_proposal_ack`.
- No user-facing change yet.
### Phase 3 — Gmail scout end-to-end
- Settings UI: "Add Gmail scout" → OAuth consent (separate scope set: `gmail.readonly` + `gmail.modify`) → encrypted token stored in `cloud_scout_configs.oauth_token_encrypted` → save scout config.
- Pub/Sub topic + webhook route + JWT verify.
- `setup_watch` on enable; weekly `renew_watch` cron.
- Cron-fallback `_scout_cron_tick` per scout.
- Triage LLM (gpt-4o-mini, Langfuse `scout-triage-system`).
- Spam auto-trash toggle (default off) per scout.
- Device-inactivity pause logic.
- WS deliver-on-reconnect drains queue → `scout_proposal` frames → ack handler → SQLite `scout_suggestions` insert with `category='unprocessed'` (Phase 4 swaps real categorization in).
- "Read full email" tool call: Electron requests body for a suggestion → BE `GmailConnector.fetch_content` → returns body transiently in tool result.
### Phase 4 — Deferred (separate spec, with task-brief rework)
- Stage 2 categorization agent (prompt + tool palette: `list_projects`, `list_tasks`, `search_notes`, memory).
- HITL UI surface in the brief: suggestion cards, approve/reject controls, "convert to task | event | note | project | actionable-only" actions.
- `list_pending_scout_suggestions` brief tool.
- Convert-to-entity mutations.
- Future connectors (Slack/Teams/Outlook/...).
## Testing Surface
- **Phase 1:** existing pytest suite still green with renamed identifiers (auth, ws_unified, schemas, models, etc.). UI smoke: settings page renders, existing local scout runs.
- **Phase 2:** unit tests for `ScoutEngine` w/ mocked `SourceConnector`. Idempotency test (replay same `source_msg_ref`).
- **Phase 3:** integration tests for Gmail webhook → triage → queue insertion (mocked `GmailConnector` for content fetch and LLM). E2E (manual): connect a real Gmail account on dev, send an email, observe queue row appear, reconnect device, observe `scout_suggestions` row land with subject/snippet.
## Open Questions (none blocking)
- OAuth-token encryption key derivation (app-global vs user-derived) — investigation step in implementation plan; document current state, security hardening is out of scope.
- Pub/Sub topic naming and IAM setup (one topic project-wide vs per-environment) — operational detail to decide during Phase 3.
## Risks
- Pub/Sub setup is per-Google-Cloud-project and requires console IAM grants — first-time setup friction.
- Gmail `users.watch` quota: 1 watch per user. We use one watch per scout, but a user has only one Gmail scout per Gmail account, so this is fine.
- `_pending_states` dict pattern in existing OAuth flow is in-memory — Pub/Sub webhook can run on any worker, so any cross-request state must be in DB, not in-memory. This design uses no in-memory state; safe.
## Acceptance
- All renames land atomically; app boots; existing local scout still operates.
- A user can connect Gmail through the Scouts settings page, see the scout marked enabled, send themselves a test email, and observe a `scout_suggestions` row appear in their local DB with `category='unprocessed'`, `rawSubject`, and `rawSnippet` populated, after the next WS reconnect.
- Spam emails (per LLM triage) are not queued; if `auto_trash_spam=true` they appear in Gmail Trash.
- BE never persists email bodies. Verified by code review of triage flow + grep for `body_text` writes.