If you can’t measure it you can’t manage it. Some may argue that there are exceptions to this truism, but backup/recovery is not one of them. While there is certainly a growing effort being made to measure backup success rates, this metric alone is not sufficient to signify a healthy backup environment. While success rate is one important risk indicator, additional risk metrics, along with other efficiency and service-level metrics, are necessary to build a true picture of backup health. Here are a few more metrics to consider:
1. Partial backup completion. These should be regarded as failures until their cause is understood, but in many environments, partials are counted as successes or simply not reported. The rationale is that some backup jobs include temporary files held open by applications and cannot be backed up successfully. But without a detailed investigation, it is impossible to know whether this is really the case. If the temporary files are truly benign, then they should be added to an “exclude” list to avoid nuisance messages. Logs and reports filled with messages about partial backups cloud the ability to identify real problems.
2. Consecutive backup failures. It’s problematic when a system backup fails in a single backup cycle, but it can be disastrous if subsequent backups of the same system fail repeatedly. This extends Recovery Point Objective metrics well beyond committed service levels and actually occurs more frequently than one might expect. A good reporting system should flag consecutive backup failures.
3. Media utilisation. This is an important efficiency metric that is often overlooked. Tape media is expensive and in many environments utilisation is significantly below 70%. In one multipetabyte backup environment, my firm, GlassHouse Technologies, found that a 10% improvement in media utilisation would translate into nearly US$400,000 (NZ$582,000) of annual tape savings.
4. Tape-drive performance. Poor tape drive performance not only results in low-drive utilisation but also increases the risk of backup failure and reduces the usable life for both media and drive mechanics. Modern tape devices are capable of 40Mbit/s throughput and higher, yet actual performance rates of less than 10Mit/s are common. Many environments are simply unaware of their dismal performance and exacerbate the problem by making design and purchasing decisions based on unrealistic expectations.
Unfortunately, these metrics are not easily obtained through the reporting capabilities of traditional backup applications. However, an investment in tools or services to produce this data might be a very good investment.