The assumption baked into most distributed systems advice doesn’t hold at the edge — and the cost of finding out the hard way is high. There’s a piece of conventional wisdom that gets passed around...
The assumption baked into most distributed systems advice doesn’t hold at the edge — and the cost of finding out the hard way is high. There’s a piece of conventional wisdom that gets passed around in backend engineering circles like it’s settled science: If a request fails, retry it. It’s good advice — for systems where the failure mode is “the service was temporarily unavailable.” It’s dangerous advice for systems where the failure mode is “the network doesn’t exist right now and won’t for… unknown hours.” The latter is the reality for edge systems. The retry model is built on a hidden assumption When you write retry logic, you’re implicitly betting that the failure is temporary (and not recurrent or even chronic) and that the infrastructure on the other side is still there, waiting. In a data center, that’s a fair bet. Services restart, load balancers reroute, and the upstream is almost always reachable within seconds. At the edge, the betting odds are against you. Think of drilling