# RFC: Slack Delivery Self-Healing for Orc Agents

Date: 2026-07-03
Status: Draft
Owner: codex-orchestrator

## Summary

Make the Orc Slack bridge behave like a reliable delivery layer, not a second conversational agent.

When a human replies in a bound Slack thread, Orc should:

1. Check whether the bound local agent session can actually receive and answer the message.
2. Heal common session failures automatically.
3. Forward the original message to the agent once healthy.
4. Stay quiet when delivery is normal.
5. Use short, user-scoped Slack cues only when delivery is delayed, healed, or failed.

The goal is simple: the user should be able to talk to the agent through Slack without memorizing commands or understanding tmux, MCP, stale binaries, pending rows, or bridge internals.

## Incident That Motivated This

In `#orc-videobrainstorm`, thread `1783053458.149659` was bound to `videobrainstorm-2-add-virality-ideas`. The thread degraded into repeated bridge warnings:

```text
I sent this to `videobrainstorm-2-add-virality-ideas`, but no Slack reply has come back yet...
```

Investigation showed two distinct failure modes in the live system at the time:

- **Delivery health was not checked before forwarding.** The bound agent's `orc mcp` process was running a deleted/stale `target/release/orc` binary after the live Orc binary was rebuilt.
- **Pending replies were tracked per event, not per thread delivery state.** Every forwarded human message created a fresh watchdog timer. When the user wrote "stop sending him messages" and "Don't", those messages were forwarded as prompts instead of Orc healing or pausing the unhealthy delivery path.

This should not require the user to know special commands. If the session is broken, Orc should detect that, heal it, and then send the message. If it cannot heal, it should explain exactly what failed once.

### Alignment With Current `origin/main`

After this incident, `origin/main` already changed two relevant pieces:

- PR #13 (`f617d88`) changed `drain_due_agent_reply_warnings` so stale `pending_agent_replies` are cleared without posting noisy Slack warnings.
- PR #14 (`b23aace`) added hot-loadable MCP tools: long-running MCP clients can use `orc_tool_manifest`, `orc_tool_call`, and `orc_tool_reload_status` to reach tools from the currently installed Orc binary when their original tool list is stale.

Those changes reduce the immediate warning spam and tool-list staleness, but they do not provide delivery preflight or automatic healing. The remaining gap is: before forwarding a Slack message, Orc still needs to prove the bound session can receive/respond, repair it when possible, then deliver the original message exactly once.

## Product Principle

Slack should feel like direct conversation with the agent.

Orc's bridge is allowed to speak only as transport/control infrastructure:

- "I restarted the stale session and sent your message."
- "I couldn't deliver this because the worktree is missing."
- "Still working; last agent signal was 2m ago."

Orc should not become a second personality in front of the agent. Normal messages should route through without commentary.

## Slack Research: Ephemeral Messages

Slack supports `chat.postEphemeral` for messages visible only to one user in a channel. It requires `channel` and `user`, accepts Block Kit `blocks`, and also accepts `thread_ts`; Slack notes that thread ephemerals only appear when there is already an active thread.

Relevant official Slack docs:

- `chat.postEphemeral`: https://docs.slack.dev/reference/methods/chat.postEphemeral/
- Messaging overview: https://docs.slack.dev/messaging/
- Slack AI agent interaction guidance: https://docs.slack.dev/ai/agent-entry-and-interaction/

Important constraints from Slack docs:

- Ephemeral delivery is **not guaranteed**. The target user must be active in Slack and a member of the channel.
- Ephemeral messages do **not persist** across reloads, apps, or sessions.
- Ephemeral messages cannot be retrieved via APIs and cannot be updated through normal `chat.update`.
- Slack recommends ephemerals for context-sensitive messages in response to user action, not unsolicited background notifications.

Conclusion:

- Use ephemeral messages for transient delivery cues: "Checking session...", "Restarted stale agent and sent your message", "Delivery failed; click/reply retry."
- Do **not** use ephemeral messages as the durable state store.
- Keep durable delivery state in SQLite and expose it through `status`, `orc show`, and normal thread replies when the state matters to everyone in the thread.

## Proposed User Experience

### Healthy Delivery

Human replies in a bound thread:

```text
Can you make the landing page idea sharper?
```

Orc behavior:

- Run delivery preflight.
- If healthy, forward to agent.
- Optionally add an `eyes` reaction or ephemeral "Sent to agent" only if the operation is slow.
- No public Orc reply.

### Self-Healed Delivery

Human replies while the agent's MCP process is stale:

```text
Can you make the landing page idea sharper?
```

Orc behavior:

- Detect stale/deleted MCP binary.
- Restart or resume the agent in the bound session.
- Forward the original message.
- Ephemeral to the requesting user:

```text
Session was stale; I restarted it and sent your message.
```

If ephemerals fail or the state affects other collaborators, fall back to one thread reply.

### Failed Delivery

Human replies while the worktree or session cannot be repaired:

```text
Can you continue?
```

Orc behavior:

- Try the configured heal path.
- If still broken, try to repair the user's path instead of asking them to know Orc internals:
  - Restart the bound session when the worktree is valid.
  - Recreate a missing tmux session for the bound repo/worktree.
  - If the bound worktree is gone but the repo is known, offer/create a replacement worktree session and move the Slack thread binding to it.
  - If more than one safe replacement exists, ask one clarifying question with buttons rather than explaining command syntax.
- If no safe repair or rebind path exists, do not keep retrying or warn repeatedly.
- Post one visible message with action buttons:

```text
I couldn't deliver this to `videobrainstorm-2-add-virality-ideas` yet.
Reason: worktree `/home/dev/dev/worktrees/videobrainstorm-2/add-virality-ideas` is missing.
I can fix that for you.
```

Primary actions:

- `Create replacement session`
- `Rebind to existing session`
- `Dismiss`

### User Asks What Is Happening

Human writes:

```text
what is happening here?
```

Orc should answer with delivery state rather than forwarding the text to a broken agent:

```text
This thread is bound to `videobrainstorm-2-add-virality-ideas`.
State: healing
Last delivery: stale MCP detected; restarting agent
Last agent signal: 4m ago
Next: I will send your latest message after restart.
```

This can be ephemeral to the asker unless the state is a shared failure.

## Delivery State Model

Add a durable thread delivery state keyed by `(channel_id, thread_ts, session)`.

Suggested table:

```sql
CREATE TABLE slack_thread_delivery_state (
  channel_id TEXT NOT NULL,
  thread_ts TEXT NOT NULL,
  session TEXT NOT NULL,
  state TEXT NOT NULL,
  reason TEXT,
  last_user_event_id TEXT,
  last_user_ts TEXT,
  last_agent_signal TEXT,
  last_agent_signal_at INTEGER,
  last_delivery_attempt_at INTEGER,
  last_heal_attempt_at INTEGER,
  heal_attempt_count INTEGER NOT NULL DEFAULT 0,
  updated_at INTEGER NOT NULL,
  PRIMARY KEY (channel_id, thread_ts, session)
);
```

States:

- `active`: healthy; forward normally.
- `checking`: preflight in progress.
- `healing`: repair in progress.
- `delivered`: latest message was delivered to agent.
- `waiting`: delivered and waiting for agent signal.
- `failed`: delivery failed; do not keep retrying same event.

This complements the current `origin/main` behavior where due `pending_agent_replies` are cleared without posting noisy Slack warnings. The remaining need is a durable per-thread delivery state so the user can see whether the latest Slack message was delivered, healing, or failed. `pending_agent_replies` can remain as an internal timeout marker, but user-visible state should be keyed by thread/session.

## Health Preflight

Before forwarding a Slack thread prompt, Orc should evaluate:

1. **Binding exists**
   - There is a `slack-thread` binding for `(channel_id, thread_ts)`.
   - Binding has a session.

2. **Session exists**
   - `discover()` can find the tmux session by exact name or session id.
   - The session still maps to the expected repo/worktree when binding has repo metadata.

3. **Pane can receive input**
   - At least one live pane exists.
   - If pane is at a shell prompt and `agent.auto_start_slack_created_sessions` is enabled, start the configured agent.
   - If pane is dead or target ambiguous, repair or fail clearly.

4. **Agent process is usable**
   - The pane has a Codex/Claude process, or can be started.
   - The process is not obviously stuck in a shell fallback.

5. **MCP is current enough**
   - Detect child `orc mcp` process for the agent where possible.
   - Flag `/proc/<pid>/exe -> ... (deleted)` as stale.
   - Compare the MCP executable mtime/build id to the live `target/release/orc` when available.
   - Check `orc_tool_reload_status` / dynamic dispatch availability where possible. A long-running MCP server can be acceptable if it can call the current installed Orc binary through the hot-load path.

6. **Slack reply path exists**
   - Bound thread is valid.
   - Bot token present for bridge process.
   - Agent prompt envelope includes fallback CLI commands.

## Healing Actions

Healing should be conservative and scoped to the bound worktree/session:

- **Shell-only pane:** start configured agent command in the existing pane.
- **Stale/deleted MCP:** restart the agent process in the same session so it launches a current `orc mcp`.
- **Dead pane/session missing:** recreate the tmux session for the bound repo/worktree if the worktree exists and is safe.
- **Worktree missing but repo known:** create a replacement task worktree/session when the task name is recoverable, or present candidate existing sessions for one-click rebind.
- **Worktree missing and repo unknown:** fail with one actionable message and a `Choose repo` path when Slack channel binding or thread history can infer candidates.
- **Repo mismatch:** fail closed; do not send prompt.
- **Dirty worktree:** does not block delivery, but include in diagnostic status if relevant.

After a successful heal, Orc forwards the original Slack message exactly once.

## Slack Message Surfaces

Use three surfaces deliberately:

### Ephemeral

Use for private, transient delivery feedback to the requester:

- "Checking session..."
- "Restarted stale agent and sent your message."
- "Still healing; retrying once."

Do not rely on ephemerals for durable state because Slack does not guarantee delivery or persistence.

### Thread Reply

Use when the state matters to everyone in the thread or delivery failed:

- "I couldn't deliver this..."
- "This thread is no longer bound..."
- "I restarted the stale session and sent the message." only when ephemeral failed or a shared interruption occurred.

### Reactions

Use as lightweight signals:

- `eyes`: accepted/checking.
- Avoid automatic `white_check_mark` from the bridge unless delivery actually completed or the agent explicitly signaled done.
- Agent reactions should count as agent signals and clear/update delivery state.

## Today Scope: Personal Slack Follow-Up Nudges

The same bridge should support a separate personal-assistant surface today: Orc notices Slack messages Marvin likely owes a reply to, or messages where Marvin asked for something and nobody replied within 2 hours.

This should not be implemented as "send every Slack message to an LLM." It should be deterministic first, with an LLM only as a narrow gate over candidate snippets.

### Product Goal

The user should be able to write naturally in Slack and still see lightweight follow-up prompts when a thread needs attention:

- "You may owe Sarah a reply in `#launch`: she asked whether to ship the copy today."
- "No one replied to your question in `#design` after 2h."

Each nudge should include at most three primary actions:

- `Draft reply`
- `Mark done`
- `Remind later`

The Slack thread permalink should be part of the message text, not a fourth button. Ignore/snooze variants can live behind `Remind later` or natural-language replies such as "ignore this thread."

### Scope and Privacy

Default behavior should be off until explicitly enabled.

Start with an allowlist:

- Slack workspace id.
- Channel ids and DM/MPIM scopes that Marvin opts into.
- Optional quiet hours and muted channel list.

Store bounded metadata by default, not a permanent raw Slack archive:

- `channel_id`, `thread_ts`, `message_ts`, sender id, last actor id.
- message permalink where available.
- short excerpt and/or text hash.
- candidate kind, confidence, state, next nudge time.

Raw text should be retained only while a candidate is active, or behind an explicit retention setting. Do not write raw private Slack contents into repo docs, wiki, logs, or prompts to unrelated agents.

Likely Slack app requirements:

- History scopes for opted-in surfaces: public channels, private channels, DMs, and group DMs as needed.
- `chat:write` for personal nudges.
- Slack interactivity enabled for button actions. Orc can handle action payloads through the existing bridge ingress/socket-mode path rather than requiring the user to type command keywords.

### Deterministic Candidate Detector

Maintain Slack cursors per conversation and scan only new messages.

Candidate kinds:

- `needs_my_reply`: someone else asked Marvin a direct question, mentioned Marvin, DM'd Marvin, or replied in a thread where Marvin is the likely owner and Marvin has not replied since.
- `awaiting_their_reply`: Marvin asked a question or made a request and no non-Marvin human replied after 2 hours.

Deterministic filters:

- Ignore bot messages unless the bot is explicitly allowlisted.
- Ignore archived, muted, or non-allowlisted conversations.
- Ignore messages newer than the configured threshold.
- Ignore threads where Marvin replied after the candidate message.
- Ignore candidates already marked done, ignored, or snoozed.
- Treat relevant reactions such as done/acknowledged as candidate-closing signals when configured.
- Collapse multiple messages in one thread into one candidate with the latest relevant context.

Suggested tables:

```sql
CREATE TABLE slack_message_cursors (
  conversation_id TEXT PRIMARY KEY,
  last_ts TEXT NOT NULL,
  updated_at INTEGER NOT NULL
);

CREATE TABLE slack_followup_candidates (
  id TEXT PRIMARY KEY,
  kind TEXT NOT NULL,
  conversation_id TEXT NOT NULL,
  thread_ts TEXT NOT NULL,
  message_ts TEXT NOT NULL,
  user_id TEXT NOT NULL,
  excerpt TEXT,
  text_hash TEXT,
  state TEXT NOT NULL,
  deterministic_reason TEXT NOT NULL,
  llm_confidence REAL,
  llm_reason TEXT,
  next_nudge_at INTEGER,
  created_at INTEGER NOT NULL,
  updated_at INTEGER NOT NULL
);

CREATE TABLE slack_followup_actions (
  id TEXT PRIMARY KEY,
  candidate_id TEXT NOT NULL,
  action TEXT NOT NULL,
  slack_action_ts TEXT,
  created_at INTEGER NOT NULL
);
```

### LLM Gate

Only pass a small candidate packet to the model:

- Conversation type.
- Latest relevant excerpt(s).
- Who wrote last.
- Whether Marvin was mentioned or authored the earlier request.
- Thread timing and deterministic reason.

Expected model output:

```json
{
  "requires_nudge": true,
  "kind": "needs_my_reply",
  "confidence": 0.84,
  "reason": "Direct question to Marvin with no later Marvin reply",
  "suggested_action": "Reply"
}
```

The LLM should be allowed to suppress noisy candidates, not create candidates from arbitrary history. Low-confidence candidates stay silent or go to a digest rather than producing immediate nudges.

### Nudge Surface

For 2-hour follow-ups, prefer a DM or App Home-style personal surface over channel ephemerals:

- Ephemeral messages are useful for direct responses to recent user actions, but Slack does not guarantee persistence and they are easy to miss for delayed reminders.
- A DM/App Home nudge can be durable, private, and action-oriented.
- Channel/thread ephemerals can still be used for immediate acknowledgement after a button click, such as "Snoozed until tomorrow."

Each nudge should include three Slack Block Kit buttons with stable action ids. Button handling should update candidate state and, where useful, route a natural-language instruction to the relevant bound agent:

- `Draft reply`: ask Orc to draft a reply from the thread context and post it only after user approval.
- `Mark done`: close the candidate.
- `Remind later`: default to 1 hour; support natural-language follow-up like "tomorrow" or "ignore this thread" from the nudge thread.

The nudge text should include a direct Slack permalink so "open thread" is still one click without consuming a primary action slot.

### Implementation Track

This is related to Slack usability and should be built today as its own stack. Delivery self-healing remains the prerequisite for routing messages to bound agents, but the follow-up nudge stack can start in parallel because its first two PRs are mostly storage, Slack history ingestion, and candidate detection.

PRs:

1. **Follow-up settings and cursors**
   - Add opt-in config, conversation allowlist, cursor storage, and read-only history ingestion.
   - Verify no raw Slack text is written to logs/wiki by default.

2. **Deterministic 2-hour candidate detector**
   - Implement `needs_my_reply` and `awaiting_their_reply`.
   - Add tests for replies, thread ownership, snooze/dismiss, reactions, bot filtering, and channel mute filtering.

3. **LLM gate**
   - Add a minimal candidate packet and structured model output.
   - Gate only preselected candidates.
   - Add fallbacks for model errors: do not nudge on uncertain failure.

4. **Nudge delivery and buttons**
   - Send personal DM/App Home nudges with exactly three primary buttons: `Draft reply`, `Mark done`, and `Remind later`.
   - Handle button payloads through the bridge.
   - Record action state and expose `orc slack followups status`.

5. **Natural-language navigation**
   - Let Marvin ask "what did I miss?", "show pending replies", "snooze launch followups", or "draft a reply to Sarah" without keyword command memorization.
   - Keep deterministic actions available under the hood, but make Slack conversation the primary interface.

Today build order:

1. Land settings/cursors and deterministic candidate storage.
2. Add the 2-hour detector and tests against Slack fixture payloads.
3. Add nudge delivery with the three-button surface.
4. Add the LLM suppression gate behind a feature flag once deterministic behavior is visible.
5. Add natural-language follow-up commands after the first nudge loop works end-to-end.

## Required Code Changes

Likely files:

- `src/bridge_store.rs`
  - Add delivery state table and migration.
  - Add upsert/get/update helpers.
  - Keep existing pending-reply cleanup behavior, but connect it to delivery state transitions.

- `src/slack_bridge.rs`
  - Add `SlackWebApiClient::post_ephemeral`.
  - Add delivery preflight before `actions.send_prompt`.
  - Add heal path before prompt forwarding.
  - Add Slack interactive handling for delivery repair buttons.
  - Change the pending-reply timeout path to mark delivery state failed/unhealthy once.
  - Record agent signals from Slack MCP replies/reactions.
  - Expand status text to show delivery state.

- `src/main.rs`
  - Add CLI/operator visibility in `orc show` and `orc ls` where useful.
  - Detect stale/deleted `orc mcp` child processes for session detail.
  - Make unavailable context usage explain the actual source, for example "no visible Codex context meter found in captured pane status," rather than implying a missing database.

- `src/mcp_server.rs`
  - Make `orc_slack_reply`, `orc_slack_reply_blocks`, and `orc_slack_react` update delivery state.
  - Treat reactions as meaningful agent signals.

- `docs/runbooks/orc-bridge.md`
  - Document the self-healing delivery model and Slack surfaces.

- `docs/wiki/repos/codex-orchestrator.md`
  - Record durable bridge responsibility after implementation.

Today follow-up nudge files:

- `src/slack_followups.rs`
  - Conversation cursors, candidate detection, LLM gate calls, and nudge state transitions.

- `src/slack_bridge.rs`
  - Slack history ingestion entrypoints and interactive button handling.

- `src/bridge_store.rs`
  - Follow-up cursor/candidate/action tables and migrations.

- `docs/runbooks/orc-bridge.md`
  - Opt-in setup, required Slack app permissions, privacy boundaries, and operator troubleshooting.

## PR Plan

### PR 1: Delivery State Storage and Status Plumbing

Scope:

- Add `slack_thread_delivery_state` table.
- Add store helpers and tests.
- Add status rendering that includes delivery state when available.
- No behavior change to routing yet.

Testing:

```bash
cargo fmt --check
cargo test --lib bridge_store::tests
cargo test --bin orc slack_status
```

Rollback safety:

- Additive SQLite table.
- Existing routing and replies continue to work if reverted.

### PR 2: Session Health Preflight

Scope:

- Add a health model for bound Slack sessions.
- Check session exists, pane live, agent process present, and stale/deleted MCP process.
- Surface health in `orc show <session>` and Slack `status`.
- No auto-restart yet.

Testing:

```bash
cargo fmt --check
cargo test --bin orc
cargo test --lib
/home/dev/.local/bin/orc show videobrainstorm-2-add-virality-ideas --lines 0
```

Rollback safety:

- Read-only diagnostics.
- Does not mutate sessions.

### PR 3: Ephemeral Delivery Cues

Scope:

- Add `SlackWebApiClient::post_ephemeral`.
- Add bridge helper for user-scoped ephemeral messages.
- Use ephemeral only for delivery status generated directly from a user action.
- Fall back to ordinary thread reply only for failures or when ephemeral posting fails.

Testing:

```bash
cargo fmt --check
cargo test --lib slack_bridge
cargo test --bin orc
```

Live smoke:

```bash
set -a; . ~/.config/orc/bridge.env; set +a
orc bridge slack-daemon --once --json
```

Rollback safety:

- Ephemeral support is optional; failures fall back to current messaging.

### PR 4: Auto-Heal Before Forwarding

Scope:

- Insert preflight before `ThreadPrompt` forwarding.
- If unhealthy but repairable, restart/resume the local agent, recreate the missing tmux session, or create/rebind a replacement session, then send the original Slack message once.
- Update delivery state to `checking`, `healing`, `delivered`, or `failed`.
- Post one short ephemeral/user-visible note only when healing happened, and one visible repair choice when automatic repair needs user choice.
- Keep the failed-delivery action surface to three buttons: `Create replacement session`, `Rebind to existing session`, and `Dismiss`.

Testing:

```bash
cargo fmt --check
cargo test --lib
cargo test --bin orc
scripts/slack-bridge-self-test --send-probe --session codex-orchestrator-smoke-test
```

Manual smoke:

1. Bind a Slack thread to a test session.
2. Stop the agent process but leave tmux pane alive.
3. Reply in Slack.
4. Verify Orc restarts the agent and forwards the original message exactly once.
5. Remove a test session/worktree binding and verify Orc can create a replacement or offer rebind candidates without requiring typed commands.

Rollback safety:

- Behavior is scoped to bound Slack threads.
- If healing fails, message is not forwarded and state is visible.

### PR 5: Delivery Timeouts and Agent Signals

Scope:

- Keep `origin/main`'s no-noise pending cleanup behavior.
- Convert pending delivery timeouts into one delivery-state transition per thread/session.
- Make `orc_slack_reply`, `orc_slack_reply_blocks`, and `orc_slack_react` record agent signals.
- Clear pending delivery on reply/blocks and mark appropriate state on reactions.
- Ensure repeated user messages while unhealthy do not create repeated public failures.

Testing:

```bash
cargo fmt --check
cargo test --lib slack_bridge::tests::slack_pending_agent_replies_are_cleared_without_posting_noise
cargo test --bin orc
```

New regression test:

- Forward three messages to one unhealthy thread.
- Drain pending delivery timeouts.
- Assert one delivery-state failure transition, not three public messages.

Rollback safety:

- Existing pending table can remain as compatibility storage until this PR is proven.

### PR 6: Runbook and Operator UX Polish

Scope:

- Document delivery states and healing behavior.
- Add examples for healthy, healed, and failed delivery.
- Add troubleshooting steps for stale MCP processes.
- Update wiki closeout.

Testing:

```bash
scripts/wiki-audit --strict --base origin/main
git diff --check
```

Rollback safety:

- Documentation only.

### PR 7: Context Usage Status Explanation

Scope:

- Keep `orc_agent_runtime_status` read-only and transcript-free.
- When context usage is unavailable, report the reason explicitly:
  - no visible context meter in pane capture;
  - non-Codex/non-Claude pane;
  - capture failed;
  - unsupported agent status format.
- Add a test that verifies WPM-like status lines return model/effort plus a clear unavailable reason.

Testing:

```bash
cargo fmt --check
cargo test --bin orc agent_runtime_status
printf '%s\n' '<MCP tools/call payload>' | timeout 5 orc mcp
```

Rollback safety:

- Output-shape additive if the existing `context_source` remains.

### PR 8: Follow-Up Settings, Cursors, and Candidate Storage

Scope:

- Add opt-in config, conversation allowlist, cursor storage, and read-only Slack history ingestion.
- Add follow-up candidate/action tables.
- Verify no raw Slack text is written to logs/wiki by default.

Testing:

```bash
cargo fmt --check
cargo test --lib bridge_store::tests
cargo test --bin orc slack_followups
```

Rollback safety:

- Additive tables and disabled-by-default config.

### PR 9: Deterministic 2-Hour Candidate Detector

Scope:

- Implement `needs_my_reply` and `awaiting_their_reply`.
- Add filters for bot messages, muted/ignored conversations, later Marvin replies, snoozed candidates, and reaction closures.
- Collapse multiple messages in one thread into one candidate.

Testing:

```bash
cargo fmt --check
cargo test --bin orc slack_followup_candidates
```

Rollback safety:

- No nudges sent yet; candidate generation can be disabled by config.

### PR 10: Personal Nudge Delivery with Three Buttons

Scope:

- Send personal DM/App Home nudges with exactly three primary buttons: `Draft reply`, `Mark done`, and `Remind later`.
- Include the Slack thread permalink in message text instead of a fourth `Open thread` button.
- Handle button payloads through the bridge.
- Record action state and expose `orc slack followups status`.

Testing:

```bash
cargo fmt --check
cargo test --bin orc slack_followup_nudges
```

Live smoke:

1. Seed a candidate fixture older than 2 hours.
2. Run the nudge drain in dry-run mode.
3. Run against Marvin-only test channel/DM.
4. Verify all three buttons mutate candidate state correctly.

Rollback safety:

- Feature flag can disable sending while preserving candidate state.

### PR 11: LLM Suppression Gate and Natural-Language Follow-Up

Scope:

- Add a minimal candidate packet and structured model output.
- Let the LLM suppress noisy deterministic candidates, but not create candidates from arbitrary workspace history.
- Support natural-language follow-up messages such as "what did I miss?", "snooze launch followups", and "draft a reply to Sarah."

Testing:

```bash
cargo fmt --check
cargo test --bin orc slack_followup_llm_gate
```

Rollback safety:

- Feature flag can bypass the LLM and use deterministic candidates only.

## Dependency Graph

```text
main
 └─ Stack A: Slack delivery self-healing
    PR 1: Delivery state storage and status plumbing
    PR 2: Session health preflight
    PR 3: Ephemeral delivery cues
    PR 4: Auto-heal, recreate, or rebind before forwarding
    PR 5: Delivery timeouts and agent signals
    PR 6: Runbook and operator UX polish
    PR 7: Context usage status explanation
 └─ Stack B: Personal Slack follow-up nudges
    PR 8: Follow-up settings, cursors, and candidate storage
    PR 9: Deterministic 2-hour candidate detector
    PR 10: Personal nudge delivery with three buttons
    PR 11: LLM suppression gate and natural-language follow-up
```

PR 3 can be developed in parallel after PR 1 if needed, but PR 4 should wait for PR 2. PR 5 should land after PR 4 so it can use the final delivery state model. PR 7 can land independently. Stack B can start today in parallel with Stack A after the storage migration approach is settled; PR 10 should wait for PR 8 and PR 9.

## Open Questions

- Should auto-healing restart an agent process that has uncommitted composer text, or should it fail and ask for manual confirmation?
- Should stale MCP detection be a warning only at first, with auto-restart limited to Slack-created sessions?
- Should the bridge post ephemeral "checking" immediately, or only if preflight/healing exceeds a short threshold such as 2 seconds?
- Should `eyes` from the agent mean "working" and `white_check_mark` mean "done", or should only actual replies complete delivery?
- Should delivery state be visible in `orc ls` as a compact Slack state column, or only in `orc show` and Slack `status`?
- For follow-up nudges, should `Remind later` default to 1 hour or tomorrow when the candidate is already older than 2 hours?

## Success Criteria

- A healthy bound Slack thread forwards messages to the agent with no extra public Orc chatter.
- A stale/deleted MCP process is detected before delivery.
- A repairable stale session is healed and receives the original Slack message exactly once.
- A non-repairable session first tries to create a replacement or rebind safely; if choice is needed, the user sees at most three action buttons and no command syntax.
- Ephemeral delivery cues are used only as transient UX; SQLite remains the durable source of truth.
- `status` and `orc show` clearly explain what is happening without requiring the user to know internal bridge concepts.
- Follow-up nudges are buildable today: opt-in history ingestion, deterministic 2-hour candidates, three-button nudges, and LLM suppression are scoped as concrete PRs.
- `orc_agent_runtime_status` no longer leaves users guessing when context usage is unavailable.
