2015/08/05

A brief history of the Restore Point Simulator

During the development of the restore point simulator, I often have encountered questions from users that led me to believe that it is not always clear how to use the tool and what it can do for you. In this blog article series, I want to take the time and explain you  why RPS was developed in the first place and how you can use it.

In the beginning there was nothing, just our famous formula to calculate repository spaces. I'll quote it here because it is still the main idea behind RPS. Many Veeam SEs had there own excel configuration sheet to quickly spit out some numbers, some more pretty than others.

Backup size = C * (F*Data + R*D*Data)
Data = sum of processed VMs size by the specific job (actually used, not provisioned)
C = average compression/dedupe ratio (depends on too many factors, compression and dedupe can be very high, but we use 50% - worst case)
F = number of full backups in retention policy (1, unless backup mode with periodic fulls is used)
R = number of rollbacks (or increments) according to retention policy (14 by default)
D = average amount of VM disk changes between cycles in percent (we use 10% right now, but will change it to 5% in v5 based on feedback... reportedly for most VMs it is just 1-2%, but active Exchange and SQL can be up to 10-20% due to transaction logs activity - so 5% seems to be good average)

This formula has some difficulties. First of all the (C)ompression ratio and the (D)elta are difficult parameters to estimate but it does give you some hints what we at Veeam use inside and a fairly good explanation why these values are chosen. But more difficult are F and R. These values define how much full backups you will need or how many incrementals you need. With reverse incremental / forever incremental, that is quite easy to calculate, you'll have F = 1 and R = rps - F. 

However when you talk about weekly synthetics or active fulls, the number is rather difficult to calculate. Even Veeam users do not always understand the effect of a certain policy. For example, if you configure a forward incremental with weekly full and 2 restore points (rps), you can expect up to 9 rps, because of dependencies. I had countless discussions with customers arguing that Veeam did (does) not respect their rps policy, when in fact it does its absolute best to respect your policy. If you run the simulation, you can actually see the dependency. In the fist column (called Retention), you will see something like 3 (2) or 4 (2). This means that point 3 or 4 are both kept because point (2) is dependent on it.

Now if you want to excellify this, you can come up with something like F = #Weeks + 1, R = (F*7*#DailyBackups - F).  Imagine 14 rps with daily backup, that would be F = 2+1 =3 and R = 3*7*1 - 3 = 21 - 3 = 18. Well that would be really close to what RPS says, but explaining that to people does take some time and it is not always accurate but more guesstimation.

Another common misconceptions is that a monthly backup would require less space than a weekly backup. While this can be the case, remember that a monthly backup would create a chain of 30 points. If you configure a policy of 14 points in forward incremental with monthly full, the worst case scenario would be 12 days after a second full backup is created. This because you got 12 increments dependent on the current full, but you need to keep the whole previous chain, because the oldest restore point is an increment dependent on a previous full backup and a chain of 30 increments. If you configured weekly full, a chain would be maximally 7 days, so less would be stored. This can exponentially grow when you do for example a backup every 12 hours or even more. However if configure for example 60 restore points, a monthly full backup can be cheaper than a weekly full backup. The more days worth of restore points configured, the more likely a monthly full backup will actually consume less space.

These 2 examples, show exactly why RPS was made. Different customers cases require different approaches. Also, it reconfirms that assumption is the mother of all mistakes. So explaining how retention works without very difficult formula's was actually my main goal when the first edition of RPS was made.

Another example is my new all time favorite that shows why what feels natural is not always reality. Some months ago, a partner thought a forever incremental backup chain of 365 would be more efficient than a GFS policy with 12 fulls.  This even surprised me the first time I ran it because incremental backups feels more lightweight. I remembered from my v7 SE training that GFS should be more efficient, but just running the simulation reconfirms this.

It is true that forever incremental versus weekly full is so much more efficient in terms of disk space savings. However 30 incrementals, quickly add up and for long time retention, a monthly full could be more efficient. There is one caveat. With 365 increments, you do have more granularity than 12 full monthly points in time. However, I do want to remind you that those 12 full backups are completely independent of each other. So a single bit rot corruption would only impact one point, while in a 365 restore point chain, this potentially impacts the whole chain. So I think in the majority of cases the more efficient disk usages and the in-dependency of points is better than a very long chain of increments, but hey, it is up to a company to decide their policy.

Finally, I remember one of the major updates was adding GFS support. Calculating and explaining GFS policies is nearly impossible with Excel. Why? Imagine you configured weekly backups to keeps the restore points of Sunday and you configure monthly backups to keep the restore points of the first Sunday of the month. In this case, the first Sunday of the month backup, could be used to satisfy the weekly backup policy as well as the monthly backup. In fact this is what Veeam does. So if you configure for example, 12 weeklies and 3 monthlies, you would assume that the amount of full is 12+3+1 (1 for the simple retention policy). However this is not the case. If you configured your policy correctly so that weekly and monthly points can coincide (schedule button), you will actually get less points. You can see this common points again in the retention column. "10W 3M 0Q 0Y" means that it represents the 10th weekly point but also the 3th monthly point.

@poulpreben (if you don't follow him on twitter, do it now) and I spend hours discussing how we could calculate this with formula's. We concluded that the only way to actually do it was to emulate what happens inside B&R on a daily basis for some time. In fact that is what RPS does. If you configure a retention policy, it will try to predict a period of time in which the worst case scenario should occur (most data on disk). This is why, when you configure 5 yearly backups, it takes some time to calculate, because it will run over 2000 days trying to simulate the behaviour of B&R.

So TL;DR? Don't just assume, run it through RPS. Be critical with the results of RPS (Software can contain bugs) but also try to understand why something is different than you first estimated.