Failure in Reliability
In Short

Curious about Outages, Distributed Systems, DevOps, Backend, SRE related stuffs.
We are moving very quick in the tech space, the issues which were bearable few years ago nowadays either they don't exist or if exist then that system is considered as a weak system. In a distributed system we are adding more features along with more loopholes along with a tail latency(which is not bearable by today's genZ). Here are some points to ponder about maximum uptime and optimum service to the user request cycle.
One of the silent worm is "Request Failure" which can cause data loss even in a highly monitored and controlled system. At this part of SRE book by Google it's written beautifully about "The Global Chubby Planned Outage". It says that a system becomes so reliable that people are using it without having a second thought on "it may fail at some point of time". A planned synthesized failure can reveal small dependencies that could cause failures in future and can be improved and show the user that the system can fail.
Retrying can solve the issue up to some extent. You just need to put a sensible timeout then retrying should do the job. But multiple retries also can lead to more load on the failed system causing more failure. A short-circuiting mechanism can eradicate the above issue also.
Calling an external service also leads to failure sometimes and to avoid this, the best method is "Don't call it at all". In sense unless until it's really required, you can use way-around instead calling the actual service. For an example you can pre-fetch the static data.
In some scenarios we should just accept the failure and go on with old school troubleshooting methods. Locking the concurrent transactions is one of the way to avoid failure, as per this method you will preferring race condition over the latency for users.
Falling back to the previously calculated sub-optimal estimations is also a way to bet on. You can use these while other services up. This method may not be optimal but at least better than a "request failure".
Delayed response can be used to fetch some time meanwhile. For some business it may not allow for delayed response(stock market, banking etc.), but if it allows then you can definitely go for it. And it is only possible in async services where we can separate the request and response.
Default fallback logic should be there(pre-planned, properly organized and documented for more complex issues. It's always good to the place which is there already.
Thanks for reading up to here😊
sources

