Under the Hood
Your agent isn’t just Claude in a terminal. Behind every session, 54 background systems work continuously to keep things running — recovering from crashes, delivering messages reliably, syncing state across machines, and cleaning up after themselves.
None of these were designed upfront. Every system on this page exists because something actually broke in production. Sessions stalled silently, messages vanished, laptops slept and agents went brain-dead, logs filled disks, orphaned processes ate memory. Each problem showed up during real usage, got diagnosed, and got solved — then the solution became a permanent part of the platform. This isn’t speculative architecture. It’s 54 battle scars turned into armor.
This page gives you the bird’s-eye view. Scan the overview, then open any category to look inside the engine.
The Nine Categories
Section titled “The Nine Categories”| Category | What It Does | Processes |
|---|---|---|
| Session Management | Catches crashes, recovers sessions, keeps you from losing work | 4 |
| Health Monitoring | Watches the agent’s own health and alerts when something degrades | 4 |
| Core Infrastructure | Updates, config hot-reload, sleep recovery, process integrity | 7 |
| Messaging | Reliable message delivery, intelligent routing, notification batching | 5 |
| Agent Network | Discovery and communication between agents (Threadline) | 6 |
| Dashboard & Streaming | Real-time terminal output and session monitoring in your browser | 3 |
| Housekeeping | Cleans up zombie sessions, rotates logs, prunes old data | 8 |
| Lifecycle | Sleep/wake recovery and graceful shutdown | 2 |
| Platform Services | Quota tracking, commitments, evolution, memory monitoring | 9 |
Session Management
Section titled “Session Management”The safety net. Four systems work together in layers — each catches what the previous one misses.
See the 4-layer recovery stack
SessionWatchdog
Section titled “SessionWatchdog”Polls every 30 seconds for stuck bash commands. If a command has been running longer than 3 minutes, it asks an LLM: “Is this legitimately long-running (like npm install) or actually stuck?” If stuck, it escalates through Ctrl+C → SIGTERM → SIGKILL, giving the session time to recover at each step. Sessions almost always survive — the nuclear option (killing the whole session) requires a process to survive both SIGTERM and SIGKILL twice.
SessionRecovery
Section titled “SessionRecovery”The fast mechanical layer. Analyzes the conversation JSONL file to detect three failure patterns:
- Tool stalls — Claude sent a tool call but never got a result back
- Crashes — Process died with an incomplete conversation
- Error loops — Same error repeated 3+ times
When detected, it truncates the conversation to a safe point and respawns. No LLM needed — pure file analysis. Handles ~60-70% of failures instantly.
TriageOrchestrator
Section titled “TriageOrchestrator”The intelligent layer. Has 8 battle-tested heuristic patterns that resolve ~90% of remaining cases without any LLM call:
- Session dead → auto-restart
- Message lost (prompt visible but message pending) → re-inject
- JSONL actively being written → wait and check back in 5 minutes
- Fatal errors (out of memory, segfault) → auto-restart
- Context exhausted (≤3% remaining) → auto-restart
Only when no heuristic matches does it spawn a scoped Claude session to diagnose the problem. Even then, deterministic safety predicates gate every auto-action — the LLM can suggest, but only verified conditions trigger automatic recovery.
SessionMonitor
Section titled “SessionMonitor”The proactive eye. Polls every 60 seconds to classify each session as healthy, idle, unresponsive, or dead. Feeds problems into the recovery stack before users notice. Won’t spam you — one notification per issue, with a 30-minute cooldown per topic.
How they connect: SessionMonitor detects the problem → SessionRecovery tries a fast fix → if that doesn’t work, TriageOrchestrator runs heuristics → if those don’t match, it spawns an LLM diagnosis. Meanwhile, SessionWatchdog independently catches stuck commands at the process level.
🔗 Paired jobs: session-continuity-check verifies that sessions produce lasting artifacts. guardian-pulse monitors whether the recovery stack itself is functioning.
Health Monitoring
Section titled “Health Monitoring”The self-awareness layer. The agent continuously checks its own health and tells you when something breaks.
See the 4 monitoring systems
CoherenceMonitor
Section titled “CoherenceMonitor”Every 5 minutes, runs checks across 5 categories: process integrity (is the binary stale?), config coherence (does the file match what’s in memory?), state durability (are state files intact?), output sanity (is the agent producing reasonable responses?), and feature readiness (are tokens and credentials properly set?).
SystemReviewer
Section titled “SystemReviewer”Every 6 hours, runs deep functional probes — not just “is this component alive?” but “does it actually work?” Tests session spawning, scheduler health, messaging connectivity, and platform resources. Trends results over a 10-review window to detect persistent failures vs transient blips.
StallDetector
Section titled “StallDetector”Monitors message delivery. When a message is injected into a session and gets no response within 5 minutes, it verifies whether the session is truly stalled (not just busy), then triggers the recovery pipeline. Also tracks “promise detection” — when the agent says “working on it” but never follows up.
DegradationReporter
Section titled “DegradationReporter”Event-driven — fires whenever a system falls back from its primary path to a secondary one. For example, if SQLite-backed memory fails and falls back to JSONL, the reporter logs it, files a bug report, and sends you a human-readable Telegram notification. Ensures no fallback happens silently.
🔗 Paired jobs: health-check (every 5 min heartbeat), degradation-digest (groups events into patterns), state-integrity-check (validates state file consistency). The overseer-guardian reviews all monitoring jobs as a group.
Core Infrastructure
Section titled “Core Infrastructure”The invisible plumbing. You never think about these until they save you.
See the 7 infrastructure systems
AutoUpdater
Section titled “AutoUpdater”Checks for new versions every 30 minutes. When an update is found, it coalesces rapid-fire releases (waits 5 minutes for additional updates before acting), checks if there are active sessions (defers restart if so, forces after 30 minutes), and handles the restart cleanly. Can be disabled in config.
GitSyncManager
Section titled “GitSyncManager”Automatic git-based state synchronization for multi-machine setups. Debounces commits (30 seconds), runs a full sync cycle every 30 minutes, and has multi-stage conflict resolution: programmatic merging for simple cases, LLM-powered resolution for complex ones, human escalation as a last resort.
🔗 Paired job: git-sync runs hourly full reconciliation with tiered model escalation.
LiveConfig
Section titled “LiveConfig”Watches config.json every 5 seconds for changes. When a value changes, it emits events so other systems can hot-reload without a server restart.
SleepWakeDetector
Section titled “SleepWakeDetector”Ticks every 2 seconds. If the gap between ticks exceeds 10 seconds, your machine slept. On wake, it fires an event that triggers: tunnel reconnection, Telegram re-polling, session health re-checks, and heartbeat resumption. Without this, opening your laptop would leave the agent looking online but actually broken.
CaffeinateManager
Section titled “CaffeinateManager”macOS only. Runs caffeinate -s to prevent your Mac from sleeping while the agent is running. Monitors the process every 30 seconds and restarts it if it dies.
ProcessIntegrity
Section titled “ProcessIntegrity”Freezes the running version at startup and compares it to what’s on disk. Detects when npm install -g updated the binary but the running process still has old code in memory.
ForegroundRestartWatcher
Section titled “ForegroundRestartWatcher”When running without a supervisor, watches for restart signals (written by AutoUpdater after an update). Notifies you, waits 3 seconds for graceful shutdown, then exits so the process manager can restart with the new code.
Messaging
Section titled “Messaging”Reliable delivery with intelligent routing. Messages don’t get lost, and they go to the right session.
See the 5 messaging systems
SessionSummarySentinel
Section titled “SessionSummarySentinel”Every 60 seconds, captures terminal output from each active session and generates a structured summary via Haiku. Uses hash-based change detection to skip sessions with no new output. These summaries enable intelligent message routing — when you send a message marked “send to best session,” the system scores each session’s relevance and picks the right one.
SessionActivitySentinel
Section titled “SessionActivitySentinel”Every 30 minutes, creates condensed digests of what each session accomplished. Splits activity into meaningful chunks, summarizes each via LLM, and stores them in episodic memory. When a session completes, generates a full synthesis. This is how the agent builds long-term memory of what it’s done.
🔗 Paired job: session-continuity-check reviews whether sessions actually produced lasting artifacts.
NotificationBatcher
Section titled “NotificationBatcher”Three tiers of notification urgency:
- Immediate — quota exhaustion, critical stalls (sent instantly)
- Summary — job completions, session lifecycle (batched every 30 minutes)
- Digest — routine system notices (batched every 2 hours)
Uses state-change-only deduplication: repeated identical notifications are suppressed until the content actually changes. Supports quiet hours (demotes Summary → Digest during configured times).
DeliveryRetryManager
Section titled “DeliveryRetryManager”🔗 Paired job: feedback-retry handles feedback-specific retries on a 6-hour cycle.
Three layers of retry for inter-agent messages:
- Layer 1 — Server unreachable (exponential backoff, up to 4 hours)
- Layer 2 — Session unavailable (30-second intervals, up to 5 minutes)
- Layer 3 — ACK timeout (escalates unacknowledged messages)
Plus a post-injection watchdog: 10 seconds after delivering a message, checks if the session is still alive. If it crashed during injection, the message goes back to the retry queue.
MessageStore
Section titled “MessageStore”File-based message persistence. Atomic writes (temp file + rename for crash safety), deduplication, dead-letter archiving for failed messages (30-day retention), and JSONL indexes for fast queries.
Agent Network
Section titled “Agent Network”Inter-agent communication. Optional — only activates when Threadline is enabled.
See the agent network systems
AgentDiscovery
Section titled “AgentDiscovery”5-second heartbeat. Announces this agent’s presence in the shared registry, discovers other agents on the same machine.
HandshakeManager
Section titled “HandshakeManager”Ed25519 identity key management for end-to-end encrypted communication between agents.
TrustManager
Section titled “TrustManager”Maintains trust levels for known agents: untrusted → verified → trusted → autonomous. Determines what actions other agents can take.
ThreadlineRouter
Section titled “ThreadlineRouter”Routes messages between agents via the Threadline protocol. Handles trust verification, payload validation, and delivery.
InboundMessageGate
Section titled “InboundMessageGate”Validates incoming relay messages against trust levels. Blocks oversized payloads (>64KB).
Relay Client
Section titled “Relay Client”WebSocket connection to the cloud relay for cross-machine agent communication.
Dashboard & Streaming
Section titled “Dashboard & Streaming”Real-time visibility into what your agent is doing.
See the 3 dashboard systems
WebSocketManager
Section titled “WebSocketManager”Manages dashboard connections. Handles authentication, client subscriptions, and message routing between the browser and the server.
Terminal Stream
Section titled “Terminal Stream”Captures terminal output from subscribed sessions every 500ms, computes diffs, and sends only changed content to connected dashboard clients. Efficient — no captures happen when nobody is watching.
Session List Broadcast
Section titled “Session List Broadcast”Sends the running session list to all connected clients every 5 seconds. Includes session metadata, display names, and telemetry (tool usage, subagent activity).
Housekeeping
Section titled “Housekeeping”Keeps things clean. Without these, logs grow forever and zombie processes accumulate.
See the 8 housekeeping systems
OrphanProcessReaper
Section titled “OrphanProcessReaper”Every 60 seconds, detects orphaned Claude processes that aren’t tracked by the session manager. Classifies them (managed vs orphaned vs external IDE processes), auto-kills orphans after 1 hour, and reports external processes to you.
JSONL Rotation
Section titled “JSONL Rotation”Lazy, size-based rotation built into all append-only log files. When a file exceeds 10MB, it keeps the newest 75% and atomically replaces the file. Non-fatal — rotation failure doesn’t block writes.
Session File Cleanup
Section titled “Session File Cleanup”Removes session state files for completed sessions (after 24 hours) and killed sessions (after 1 hour).
Triage Evidence Cleanup
Section titled “Triage Evidence Cleanup”Every 6 hours, removes stale triage evidence files and cleans up abandoned triage sessions.
Recovery Backup Cleanup
Section titled “Recovery Backup Cleanup”Every 6 hours, removes .bak files created during conversation JSONL truncation that are older than 24 hours.
Dead-Letter Cleanup
Section titled “Dead-Letter Cleanup”Every 6 hours, removes failed messages from the dead-letter queue that are older than 30 days.
Temp File Cleanup
Section titled “Temp File Cleanup”On server startup, removes temporary Telegram files older than 7 days.
Global Install Cleanup
Section titled “Global Install Cleanup”On server startup, removes stale global instar installations.
Lifecycle
Section titled “Lifecycle”Handles the transitions — starting up, shutting down, and everything in between.
See the 2 lifecycle systems
SleepWakeDetector
Section titled “SleepWakeDetector”Described in Core Infrastructure — detects when your machine sleeps and triggers recovery on wake.
Graceful Shutdown
Section titled “Graceful Shutdown”Signal handlers (SIGTERM/SIGINT) that ensure clean shutdown: stops all polling, persists state, disconnects messaging, closes WebSocket connections, kills the caffeinate process, and unregisters from the agent registry.
Platform Services
Section titled “Platform Services”The higher-level systems that give the agent capabilities beyond just running code.
See the 9 platform services
QuotaTracker
Section titled “QuotaTracker”Monitors Claude API token usage in real-time. Sends Telegram warnings when approaching limits, enforces quotas to prevent runaway sessions, and can auto-switch between accounts if configured.
CommitmentTracker
Section titled “CommitmentTracker”When you tell your agent to change a setting (“always use Haiku for jobs”), this system watches for config changes that revert your instruction and alerts you if it happens.
EvolutionManager
Section titled “EvolutionManager”The self-improvement loop. Detects gaps in the agent’s capabilities, generates improvement proposals, and implements approved changes. Runs the full pipeline: gap detection → proposal → review → implementation.
🔗 Paired jobs: The entire learning pipeline — reflection-trigger, insight-harvest, evolution-proposal-evaluate, and evolution-proposal-implement. The process manages the lifecycle; the jobs do the thinking.
AgentRegistry Heartbeat
Section titled “AgentRegistry Heartbeat”Every 30 seconds, writes a heartbeat to the global agent registry so other agents and tools can discover this agent.
TopicResumeMap
Section titled “TopicResumeMap”Every 60 seconds, updates the mapping between Telegram topics and session UUIDs. When a session dies and respawns, this mapping ensures the new session can resume with full conversation context via --resume.
CommitmentSentinel
Section titled “CommitmentSentinel”Scans Telegram messages every 5 minutes to detect promises the agent made (“I’ll deploy on Friday”) that weren’t formally registered.
🔗 Replaced by job: commitment-detection — same intelligence as a scheduled job, easier to tune and monitor.
MemoryMonitor
Section titled “MemoryMonitor”Tracks heap memory usage. Triggers orphan cleanup when memory exceeds 80% of available capacity.
🔗 Paired jobs: memory-hygiene (reviews MEMORY.md quality) and memory-export (regenerates MEMORY.md from the knowledge graph).
WorktreeMonitor
Section titled “WorktreeMonitor”Monitors git worktrees created for isolated agent work. Detects stale branches, reaps orphaned worktrees.
HealthChecker
Section titled “HealthChecker”Legacy health probe system — superseded by SystemReviewer’s more comprehensive tiered probe architecture.
🔗 Paired job: health-check runs a lightweight heartbeat every 5 minutes.
See Also
Section titled “See Also”- Default Jobs — The 26 scheduled jobs that run alongside these processes
- The Living System — How processes and jobs work together as a unified organism
- Self-Healing — The user-facing perspective on recovery