Skip to content

Under the Hood

Your agent isn’t just Claude in a terminal. Behind every session, 54 background systems work continuously to keep things running — recovering from crashes, delivering messages reliably, syncing state across machines, and cleaning up after themselves.

None of these were designed upfront. Every system on this page exists because something actually broke in production. Sessions stalled silently, messages vanished, laptops slept and agents went brain-dead, logs filled disks, orphaned processes ate memory. Each problem showed up during real usage, got diagnosed, and got solved — then the solution became a permanent part of the platform. This isn’t speculative architecture. It’s 54 battle scars turned into armor.

This page gives you the bird’s-eye view. Scan the overview, then open any category to look inside the engine.

CategoryWhat It DoesProcesses
Session ManagementCatches crashes, recovers sessions, keeps you from losing work4
Health MonitoringWatches the agent’s own health and alerts when something degrades4
Core InfrastructureUpdates, config hot-reload, sleep recovery, process integrity7
MessagingReliable message delivery, intelligent routing, notification batching5
Agent NetworkDiscovery and communication between agents (Threadline)6
Dashboard & StreamingReal-time terminal output and session monitoring in your browser3
HousekeepingCleans up zombie sessions, rotates logs, prunes old data8
LifecycleSleep/wake recovery and graceful shutdown2
Platform ServicesQuota tracking, commitments, evolution, memory monitoring9

The safety net. Four systems work together in layers — each catches what the previous one misses.

See the 4-layer recovery stack

Polls every 30 seconds for stuck bash commands. If a command has been running longer than 3 minutes, it asks an LLM: “Is this legitimately long-running (like npm install) or actually stuck?” If stuck, it escalates through Ctrl+C → SIGTERM → SIGKILL, giving the session time to recover at each step. Sessions almost always survive — the nuclear option (killing the whole session) requires a process to survive both SIGTERM and SIGKILL twice.

The fast mechanical layer. Analyzes the conversation JSONL file to detect three failure patterns:

  • Tool stalls — Claude sent a tool call but never got a result back
  • Crashes — Process died with an incomplete conversation
  • Error loops — Same error repeated 3+ times

When detected, it truncates the conversation to a safe point and respawns. No LLM needed — pure file analysis. Handles ~60-70% of failures instantly.

The intelligent layer. Has 8 battle-tested heuristic patterns that resolve ~90% of remaining cases without any LLM call:

  • Session dead → auto-restart
  • Message lost (prompt visible but message pending) → re-inject
  • JSONL actively being written → wait and check back in 5 minutes
  • Fatal errors (out of memory, segfault) → auto-restart
  • Context exhausted (≤3% remaining) → auto-restart

Only when no heuristic matches does it spawn a scoped Claude session to diagnose the problem. Even then, deterministic safety predicates gate every auto-action — the LLM can suggest, but only verified conditions trigger automatic recovery.

The proactive eye. Polls every 60 seconds to classify each session as healthy, idle, unresponsive, or dead. Feeds problems into the recovery stack before users notice. Won’t spam you — one notification per issue, with a 30-minute cooldown per topic.

How they connect: SessionMonitor detects the problem → SessionRecovery tries a fast fix → if that doesn’t work, TriageOrchestrator runs heuristics → if those don’t match, it spawns an LLM diagnosis. Meanwhile, SessionWatchdog independently catches stuck commands at the process level.

🔗 Paired jobs: session-continuity-check verifies that sessions produce lasting artifacts. guardian-pulse monitors whether the recovery stack itself is functioning.


The self-awareness layer. The agent continuously checks its own health and tells you when something breaks.

See the 4 monitoring systems

Every 5 minutes, runs checks across 5 categories: process integrity (is the binary stale?), config coherence (does the file match what’s in memory?), state durability (are state files intact?), output sanity (is the agent producing reasonable responses?), and feature readiness (are tokens and credentials properly set?).

Every 6 hours, runs deep functional probes — not just “is this component alive?” but “does it actually work?” Tests session spawning, scheduler health, messaging connectivity, and platform resources. Trends results over a 10-review window to detect persistent failures vs transient blips.

Monitors message delivery. When a message is injected into a session and gets no response within 5 minutes, it verifies whether the session is truly stalled (not just busy), then triggers the recovery pipeline. Also tracks “promise detection” — when the agent says “working on it” but never follows up.

Event-driven — fires whenever a system falls back from its primary path to a secondary one. For example, if SQLite-backed memory fails and falls back to JSONL, the reporter logs it, files a bug report, and sends you a human-readable Telegram notification. Ensures no fallback happens silently.

🔗 Paired jobs: health-check (every 5 min heartbeat), degradation-digest (groups events into patterns), state-integrity-check (validates state file consistency). The overseer-guardian reviews all monitoring jobs as a group.


The invisible plumbing. You never think about these until they save you.

See the 7 infrastructure systems

Checks for new versions every 30 minutes. When an update is found, it coalesces rapid-fire releases (waits 5 minutes for additional updates before acting), checks if there are active sessions (defers restart if so, forces after 30 minutes), and handles the restart cleanly. Can be disabled in config.

Automatic git-based state synchronization for multi-machine setups. Debounces commits (30 seconds), runs a full sync cycle every 30 minutes, and has multi-stage conflict resolution: programmatic merging for simple cases, LLM-powered resolution for complex ones, human escalation as a last resort.

🔗 Paired job: git-sync runs hourly full reconciliation with tiered model escalation.

Watches config.json every 5 seconds for changes. When a value changes, it emits events so other systems can hot-reload without a server restart.

Ticks every 2 seconds. If the gap between ticks exceeds 10 seconds, your machine slept. On wake, it fires an event that triggers: tunnel reconnection, Telegram re-polling, session health re-checks, and heartbeat resumption. Without this, opening your laptop would leave the agent looking online but actually broken.

macOS only. Runs caffeinate -s to prevent your Mac from sleeping while the agent is running. Monitors the process every 30 seconds and restarts it if it dies.

Freezes the running version at startup and compares it to what’s on disk. Detects when npm install -g updated the binary but the running process still has old code in memory.

When running without a supervisor, watches for restart signals (written by AutoUpdater after an update). Notifies you, waits 3 seconds for graceful shutdown, then exits so the process manager can restart with the new code.


Reliable delivery with intelligent routing. Messages don’t get lost, and they go to the right session.

See the 5 messaging systems

Every 60 seconds, captures terminal output from each active session and generates a structured summary via Haiku. Uses hash-based change detection to skip sessions with no new output. These summaries enable intelligent message routing — when you send a message marked “send to best session,” the system scores each session’s relevance and picks the right one.

Every 30 minutes, creates condensed digests of what each session accomplished. Splits activity into meaningful chunks, summarizes each via LLM, and stores them in episodic memory. When a session completes, generates a full synthesis. This is how the agent builds long-term memory of what it’s done.

🔗 Paired job: session-continuity-check reviews whether sessions actually produced lasting artifacts.

Three tiers of notification urgency:

  • Immediate — quota exhaustion, critical stalls (sent instantly)
  • Summary — job completions, session lifecycle (batched every 30 minutes)
  • Digest — routine system notices (batched every 2 hours)

Uses state-change-only deduplication: repeated identical notifications are suppressed until the content actually changes. Supports quiet hours (demotes Summary → Digest during configured times).

🔗 Paired job: feedback-retry handles feedback-specific retries on a 6-hour cycle.

Three layers of retry for inter-agent messages:

  • Layer 1 — Server unreachable (exponential backoff, up to 4 hours)
  • Layer 2 — Session unavailable (30-second intervals, up to 5 minutes)
  • Layer 3 — ACK timeout (escalates unacknowledged messages)

Plus a post-injection watchdog: 10 seconds after delivering a message, checks if the session is still alive. If it crashed during injection, the message goes back to the retry queue.

File-based message persistence. Atomic writes (temp file + rename for crash safety), deduplication, dead-letter archiving for failed messages (30-day retention), and JSONL indexes for fast queries.


Inter-agent communication. Optional — only activates when Threadline is enabled.

See the agent network systems

5-second heartbeat. Announces this agent’s presence in the shared registry, discovers other agents on the same machine.

Ed25519 identity key management for end-to-end encrypted communication between agents.

Maintains trust levels for known agents: untrusted → verified → trusted → autonomous. Determines what actions other agents can take.

Routes messages between agents via the Threadline protocol. Handles trust verification, payload validation, and delivery.

Validates incoming relay messages against trust levels. Blocks oversized payloads (>64KB).

WebSocket connection to the cloud relay for cross-machine agent communication.


Real-time visibility into what your agent is doing.

See the 3 dashboard systems

Manages dashboard connections. Handles authentication, client subscriptions, and message routing between the browser and the server.

Captures terminal output from subscribed sessions every 500ms, computes diffs, and sends only changed content to connected dashboard clients. Efficient — no captures happen when nobody is watching.

Sends the running session list to all connected clients every 5 seconds. Includes session metadata, display names, and telemetry (tool usage, subagent activity).


Keeps things clean. Without these, logs grow forever and zombie processes accumulate.

See the 8 housekeeping systems

Every 60 seconds, detects orphaned Claude processes that aren’t tracked by the session manager. Classifies them (managed vs orphaned vs external IDE processes), auto-kills orphans after 1 hour, and reports external processes to you.

Lazy, size-based rotation built into all append-only log files. When a file exceeds 10MB, it keeps the newest 75% and atomically replaces the file. Non-fatal — rotation failure doesn’t block writes.

Removes session state files for completed sessions (after 24 hours) and killed sessions (after 1 hour).

Every 6 hours, removes stale triage evidence files and cleans up abandoned triage sessions.

Every 6 hours, removes .bak files created during conversation JSONL truncation that are older than 24 hours.

Every 6 hours, removes failed messages from the dead-letter queue that are older than 30 days.

On server startup, removes temporary Telegram files older than 7 days.

On server startup, removes stale global instar installations.


Handles the transitions — starting up, shutting down, and everything in between.

See the 2 lifecycle systems

Described in Core Infrastructure — detects when your machine sleeps and triggers recovery on wake.

Signal handlers (SIGTERM/SIGINT) that ensure clean shutdown: stops all polling, persists state, disconnects messaging, closes WebSocket connections, kills the caffeinate process, and unregisters from the agent registry.


The higher-level systems that give the agent capabilities beyond just running code.

See the 9 platform services

Monitors Claude API token usage in real-time. Sends Telegram warnings when approaching limits, enforces quotas to prevent runaway sessions, and can auto-switch between accounts if configured.

When you tell your agent to change a setting (“always use Haiku for jobs”), this system watches for config changes that revert your instruction and alerts you if it happens.

The self-improvement loop. Detects gaps in the agent’s capabilities, generates improvement proposals, and implements approved changes. Runs the full pipeline: gap detection → proposal → review → implementation.

🔗 Paired jobs: The entire learning pipelinereflection-trigger, insight-harvest, evolution-proposal-evaluate, and evolution-proposal-implement. The process manages the lifecycle; the jobs do the thinking.

Every 30 seconds, writes a heartbeat to the global agent registry so other agents and tools can discover this agent.

Every 60 seconds, updates the mapping between Telegram topics and session UUIDs. When a session dies and respawns, this mapping ensures the new session can resume with full conversation context via --resume.

Scans Telegram messages every 5 minutes to detect promises the agent made (“I’ll deploy on Friday”) that weren’t formally registered.

🔗 Replaced by job: commitment-detection — same intelligence as a scheduled job, easier to tune and monitor.

Tracks heap memory usage. Triggers orphan cleanup when memory exceeds 80% of available capacity.

🔗 Paired jobs: memory-hygiene (reviews MEMORY.md quality) and memory-export (regenerates MEMORY.md from the knowledge graph).

Monitors git worktrees created for isolated agent work. Detects stale branches, reaps orphaned worktrees.

Legacy health probe system — superseded by SystemReviewer’s more comprehensive tiered probe architecture.

🔗 Paired job: health-check runs a lightweight heartbeat every 5 minutes.


  • Default Jobs — The 26 scheduled jobs that run alongside these processes
  • The Living System — How processes and jobs work together as a unified organism
  • Self-Healing — The user-facing perspective on recovery