A2A Delivery Health
Agent-to-agent messaging used to be fire-and-forget at every hop: a threadline_send that
reported success only meant the transport accepted the bytes — it said nothing about whether
the peer ever processed the message. A message could vanish with both sides assuming it was
delivered. A2A Delivery Health closes that gap.
What it does
Section titled “What it does”Every outbound agent-to-agent message gets a durable delivery record with a lifecycle:
awaiting-ack → acked → escalated | failed. A reply on the conversation thread counts as the
acknowledgement — so it works with any peer today, with no protocol upgrade required (a normal
reply is the proof of receipt). Each accepted inbound message also bumps the peer’s
inbound-liveness clock.
This turns “is my channel to a peer alive?” into a read instead of a guess: when you last reached them, when they last confirmed, when you last heard from them, how many messages are still pending, and whether the channel has gone stale.
How it works
Section titled “How it works”The durable spine is threadline/A2ADeliveryTracker — a per-agent SQLite store (mirroring the
proven MessageProcessingLedger substrate) that records every send (recordSent), every
processed-acknowledgement (recordAck / the thread-keyed recordAckByThread), and every accepted
inbound message (recordInboundFrom). It exposes peerHealth for the read surface and
findOverdue as the work-list for the follow-on redelivery + escalation layer. The tracker is
recording-only — it never gates or alters a send, so it physically cannot break delivery; if
it failed, messages would send exactly as before.
The wiring records on both the cross-machine relay-ingest path (where the peer’s full routing fingerprint is in scope) and the same-machine relay-agent path, and the implicit ack is keyed on the conversation thread so it is robust to the peer identifying itself differently across transports.
Read-only observability under the local /threadline/ surface:
GET /threadline/peers/health— health for every peer:peerFp,peerName,lastSentAt,lastAckedAt,lastInboundAt,pendingCount,oldestPendingAgeMs,stale, plus a top-levelstaleCount.GET /threadline/peers/:fp/health— the same record for a single peer fingerprint.
A non-zero staleCount (or stale: true for a peer) means a message has been awaiting
acknowledgement past the threshold — the peer may be dark or unreachable, so check the relay and
the peer’s address before assuming they are simply ignoring you.
Active recovery: the redelivery sentinel
Section titled “Active recovery: the redelivery sentinel”Built on the tracker’s findOverdue work-list, monitoring/A2ARedeliverySentinel is the
active-recovery layer. On a cadence it sweeps overdue (unacknowledged) messages and, for each one
still under the attempt cap, re-sends it with exponential backoff — recovering the original body
from the canonical outbox by message id. Once a message exhausts its attempts, the sentinel raises
one aggregated attention item per dark peer (never one per message), so a peer that has gone
offline or unreachable surfaces to the operator instead of failing silently.
It is recording-and-sending only — it holds no blocking authority and never alters a live send.
Ships off by default (it re-sends and escalates); enable via
monitoring.a2aRedelivery.enabled in .instar/config.json. The escalate-once guarantee is
structural: a message marked escalated leaves the findOverdue work-list, so it is never
re-escalated.