Overall Health Metric of a System
Check this

Curious about Outages, Distributed Systems, DevOps, Backend, SRE related stuffs.
Prometheus is one of the largely used tool for monitoring system metrics in many companies. This tools contains almost 130 various metrics but we don't need all the metrics every time we run a monitor-task. In some cases it exposes many metrics of a TSDB. Apparently it doesn't have enough metrics to reveal more info about any TSDB. So the point is everything having certain kind of metrics should have an overall health metrics.
This resolves two issues for the SREs.
There won't be any need to search through all the metrics for a particular info that we are expecting to be there, no need to do many grep searches for error, fail, corrupt keywords. No more searching for those particular 8 metrics of TSDB out of 130 metrics, just one overall health of the TSDB should be enough.
Whenever a new update enters into a system, accordingly the metrics also change, this becomes an issue for the folks who track down metrics on day-to-day basis. Because they have to add new metrics to the board with proper reconstruction. In this whole journey if something goes wrong or breaks then again that becomes another challenge in a chain.
So how this should work? It should cover some basics of most of the metrics and should give an average health of a system which should be enough to decide whether the folk watching Prometheus can choose in which area particularly he/she needs to dig in more. Along with this the overall metrics should be applied to sub-systems and the global system too to avoid unnecessary chaos in the team .
Suppose you have quite good sized system with log capturing and alerting mechanisms for sub-level and global-level, then you can add a webhook into the logging system to track the last time message that came from any of the sub-level system with proper priority level. So a Prometheus metric can be added to that particular stuff for future use.
Thanks for reading up to here.
source - HaveGeneralHealthMetric

