What do you do to prevent software downtime? Assessing? Tools? Metrics?

I hope I'm not the only person who is interested in this question because his IT infrastructure is prone to fail (as any software in the world :)).

For the past few months, I have been in search of information about software downtime prevention, tips and tricks, best practices, recommendations. I came across items like preventative maintenance, personnel training, etc.

I've also heard that vendors have software that predicts downtimes. Consequently, it allows IT experts to receive tech metrics that show the possibility of downtime, track anomalies, reduce unplanned failures, etc. For example, you might be aware of the solutions that diagnose IT infrastructures like InsightCat, Datadog, Dynatrace, etc.

How do you assess your system health and predict downtimes? Do you use a downtime prevention tool for this? What critical metrics indicate that something is wrong with the system?

Thank you in advance.

