Failure Protocols
We do not ask if a system will fail. We ask when. These are the standard operating procedures for the 4 most common failure modes.
Input Failures
Risk: HighMissing fields, malformed JSON, or duplicates.
PROTOCOL: Validate at entry -> Reject Invalid Payload -> Log Error. Bad data never enters the logic stream.
External API Failures
Risk: MediumRate limits (429), Timeouts (504), or Auth Failures (401).
PROTOCOL: Exponential Backoff Retry (x3) -> Circuit Breaker -> Human Alert. We assume APIs are unreliable.
Silent Failures
Risk: CriticalThe workflow stops without crashing, often due to a 'ghost' logic path.
PROTOCOL: Heartbeat Monitors + 'Inactivity Threshold' Alerts. If a scheduled job doesn't run, we know.
Data Integrity
Risk: HighRace conditions or duplicate writes creating 'zombie' records.
PROTOCOL: Idempotency Keys -> Read-After-Write Verification. We check state before mutating it.
