2015/04/09

Veeam Data Domain integration X-Rayed

With the latest release of v8, Veeam introduced Data Domain integration. The integration is based on DD boost. But what does that actually mean. Well it means we will do faster backup copy job (and backups) towards a DD.

So what's so good about the integration? First of all Veeam supports "Distributed Segment Processing". The basic idea that Veeam will do the Data Domain deduplication at the Veeam side. The main advantage is DD deduplication is global dedup. If you have for example 2 backup copy jobs, copying each 1 windows VM without DD boost, Veeam will send over the OS blocks 2 times. Simply because they are different jobs.

With DD Boost, when the gateway server (the component that talks DD boost), has to store a second copy of the OS blocks, it will send pointers down the line because it knows that the Data Domain has already stored those duplicate blocks. The main advantage is that there will be less network usage and the DD doesn't have do any processing anymore. Hence the term Distributed (=each client) Segment (chunks of data) Processing. Also the job performance might boost significantly because the second time those same blocks need to be stored, there is no real write occurring on the Data domain, just a meta data update.

Although most people focus on this part of the DD Boost integration, that's not my favorite part of the integration. So in this article I want to focus on "Virtual Synthetics" and what it means. To understand the benefits. Lets first look at how a backup copy job works.

First of, the backup copy job or bcj doesn't copy files. It copies data and it constructs new backup files at the target side. So if you use forward incremental, reverse incremental or forever incremental, the result of the bcj will always be the same. The bcj uses a similar strategy as forever incremental. So lets take a really trivial example. Imagine you created a bcj with a retention of 2 points.

The first day you will create a full backup file.

The second day, you will create an increment file and store only blocks that have changed. No rocket science so far.
The third day, you will create another increment. However 3 points are more than the configured policy of 2. So some action is required.
Just deleting files is not an option. You can not delete the oldest backup file because it is the full backup. This is why the backup copy job does something called a merge.
The idea is that you take the blocks from the oldest increment, read them from disk and then write them back to the original full backup file, essentially updating the changed blocks in the full backup file
The result is that the full backup file is actually representing the restore point from day 2 and so the amount of backup files equals again the retention policy you configured.

But why is that bad for Data Domain? Well Data Domain was designed for sequential write. In fact, it was created to replace Tape and that is why most dedup devices have a VTL functionality.

With tape in mind, remember those old VHS cassettes? If you record one soap (different episodes) on one tape in chronological order and one day you want to binge watching the soap, you could just put in the cassette and push play. No delay when switching episodes, because it is just steaming the tape.

However, imagine you are going out and different members of the family recorded different series and movies on one cassette, when you wanted to play your specific part, you needed to skip back and forward to get to your part. This took some time cause the tape needs to go to that specific point in time.

Benching or just playing the video, in data terms is what we call Sequential I/O. You just read the data in the chronological order that you have written it. Skipping back and forward  to read (or write to) that specific part you need is what is called random I/O. And as with tapes, it is pretty slow. Now if you design your device to act like tape, you can write really fast, but the random I/O kills you.

Well this is why bcj merging is actually pretty slow on deduplication devices in general. It is a lot of random reading and writing to files. So how does DD boost help here? First of all you should understand that DD has meta data where it stores pointers to the data blocks that make up that specific files. But let's not go into too deep how a filesystem works, you got wikipedia for that

When DD boost is being utilized, Veeam will not read block a', d', h' and l' and then write them back to the DD as it usually does. Instead, it instructs the DD to point to the blocks already on disk
Because you are just writing pointers, the "merge process" is fast compared to the regular standard merge process.

So this make the Data Domain an excellent backup copy job target. Especially because you can define GFS on the bcj. So imagine you instruct the bcj to keep 6 weekly backups, you will actually have 7 full backups on the DD, one for every week and one active VBK.

One thing that does not change is the restore time. Veeam restores are pretty random, especially if you need to read from different files. Imagine you have a chain of 14 points, and you want to restore something from the newest point, with the backup copy job, that would mean accessing 14 different files in the background.

If you ask me, I think the DD is an excellent bcj target. Just keep a small amount of files on jbods for example (for example 7 to 14 points), which are excellent in handling random I/O, and then tier them to a DD. If you chose your policy correctly on the main target, 95% of the restores will come from the first tier. However if something bad happens and you need to go back a couple of months, it is acceptable that the restore might take a bit longer in favor of the huge amount of restore points you need to keep. For Veeam, this is actually the preferred architecture

2015/01/21

Microsoft iSCSITarget massive add initiators

If you ever need to add a lot of ip address to an iscsitarget in Microsoft, here is a sample script
$tgt = Get-IscsiServerTarget -targetname "vmware01"
$inits = $tgt.InitiatorIds
(10..20) | % { $ipend = $_;$ip = "192.168.253.$ipend";$initnew = new-object Microsoft.Iscsi.Target.Commands.InitiatorId("IPAddress:$ip");$inits += $initnew }
$tgt | Set-IscsiServerTarget -InitiatorIds { $inits }
This will add ip range 192.168.253.10-192.168.253.20. I surely would love wildcards :)

2015/01/12

Veeam v8 Forever Incremental

With the release of v8 a new backup method has been introduced, called forever incremental. Not a lot of fuzz around it but still a nice feature:
  • First of all, the method first creates an increment in a similar way as the traditional forward incremental. The big advantages over reverse incremental is that creating a VIB file is fairly sequential. Thus the snapshot on the target VM will be removed earlier then with reverse incremental. What is important is that overall job time might be higher, because the merge process might take longer, but the impact on production is lower.
  • The forever incremental job uses the same mechanism as the backup copy job for respecting the backup retention policy. First it creates the VIB file. Once the retention points are satisfied, it will inject the oldest VIB file in the VBK file. Again this process is fairly random but should be 33% less I/O then reverse incremental backup during merge and it is performed after the snapshots are deleted on the VM.
  • Because there is only 1 full VBK file, it uses less disk storage and only stores incremental data.
What is important is that there is still random I/O, if some job is merging and another on is still in the backup process, the later one might be impacted if you are backing up to the same spindles. Still the I/O penalty is lower (3 vs 2 I/O's).

So how do you configure it? Well you just select incremental mode and disable any synthetic or active full backups like so. If you would do this in v7, the GUI will complain you are not doing any full backups, in v8 it will switch to forever incremental


Configuring is quite easy. Now a lot of customers have asked, how can we do active fulls? Well if you configure active full backups, you are basically disabling forever incremental, thus the job won't do any merging. It is also why it is called, forever incremental.

There are some good reasons why you want to do an active full every month or every 2 months. First of all corruption. However, with Veeam, it is highly recommended to use Surebackup to execute recovery tests. In v7, a new option has been added to verify all the blocks (or the complete backup file). You can find this option, in the settings tab of of the Surebackup job. When in doubt, check the manual

If you do not run Surebackup, in v7 a manual tool was introduced "Veeam.Backup.Validator.exe". You can read more about it on Luca's blog

 In v8 this tool has been extended so you can run it on backups that are not even imported in B&R. Also you can output the report to an XML file, which should allow you to script around it. For example:
Veeam.Backup.Validator.exe /file:'V:\Backup\Backup Job Linux\Backup Job Linux.vbm' /format:xml /report:V:\linux.xml
Then with powershell, it is really easy to read out the values. Maybe the parameters might be a bit more difficult but here is an example.
param(
 $validator = "C:\Program Files\Veeam\Backup and Replication\Backup\Veeam.Backup.Validator.exe",
 $resultfile = [System.IO.Path]::GetTempFileName(),
 $backupfile = "V:\Backup\Backup Job Linux\Backup Job Linux.vbm"
)

&$validator ("/file:{0}" -f $backupfile) ("/report:{0}" -f $resultfile) "/format:xml" "/silence"


$result = [xml]$(Get-Content $resultfile)
write-host ("Result {0}" -f $result.Report.ResultInfo.Result)
write-host ("Checked {0}" -f $result.SelectSingleNode("//Parameter[@Name='Backup files count:']").InnerText)
Watch out, I tried to run the example via the powershell_ise, and the validator didn't spit out correct result. Running it manually seems to work. Also the validator seems to only check the last restore point. Instead of specifying the VBM file you can also specify the VBK file so the check will be done on this specific file.



Ok so validation (or healtcheck like it is called for a backup copy job) is not an issue. Apparently fragmentation of the backup file has been enhanced greatly as well in v8. However there is one thing you can not do in Windows, and that is shrinks files. Imagine you backup 10 VM's today but in a couple of weeks, after a migration, 2 VM's are being deleted. You might have archived them so they are no longer in production and thus are not being backed up anymore. Well the unique blocks of the VM's are being marked as deleted but the VBK file will never become smaller. When Veeam needs to store or inject more data, it will try to reuse those "blocks" or "empty space", but the file never shrinks. For a backup copy job, a method call "compacting" was introduced. It recreates the VBK file and skips empty blocks. However this methos is not (yet?) available for a normal backup job. Thus the only way to accomplish is this is to do an active full.

However, like discussed, if you enable active full in the scheduler, the backup job will switch back to the v7 forward incremental style and will not merge anything. The solution? Run an active full once in a while manually, or create a small powershell script. the script itself can be rather small like:
asnp veeampssnapin
Get-VBRJob -name "Backup Job Linux" | Start-VBRJob -FullBackup
You could schedule this for example every 2 months. If you need help scheduling, check out my previous blog post

What is important is that because you execute an active full your potential retention length might be 2x your retention policy before the previous chain is deleted. What does this mean in plain English? Well imagine if you configured 3 retention points. After a while you execute an active full. You will have the following situation

 

However at this point, nothing is being deleted because the active full starts a new backup chain. So nothing will be deleted until this new chain has 3 retention points.


2014/12/08

Instant prereq for SCOM 2012 R2 on Windows 2012 R2

If you really don't want to figure out all the individual roles and features, run :
import-module servermanager
add-windowsfeature Web-Server,NET-Framework-45-ASPNET
add-windowsfeature Web-Asp-Net,Web-Asp-Net45,Web-Metabase,Web-Windows-Auth,Web-Request-Monitor,Web-Mgmt-Console,NET-WCF-HTTP-Activation45
This will install all the necessary roles for SCOM 2012 R2. If you get CGI/ASP handlers not being registered, restart the server.

2014/08/29

Veeam Explorer For Exchange without logs

So you made a backup from your exchange server with Veeam and want to recover Exchange items. Well that is quite easy with the Veeam Explorer For Exchange. But what if you have the logs files on a different vmdk then the edb file, and you excluded the disk. Will you be able to recover from the EDB alone?

That's a question that came up on our internal forums. Well at first, it looks like it is not possible. You will get this kind of message:


Saying that you can't open the EDB because "Online Exchange backup detected, log replay is required".

So what can you do? Well first start a windows file level recovery of your exchange server.


This should mount the server disk under c:\veeamflr\exchange\ (depending on the vm name). Now start by extracting the eseutil to a defined directory on your Veeam server. By default you can find the tools and dlls under:
 c:\veeamflr\exchange\volume1\Program Files\Microsoft\Exchange Server\V15\Bin


Personally I just copied everything which starts with ese like so:
cp  "c:\veeamflr\exchange\volume1\Program Files\Microsoft\Exchange Server\V15\Bin\ese*" .

Alternatively you can also copy them from your  live exchange server.

Now let's query the DB by using eseutil and the /mh parameter like so:
PS C:\eseextract> .\eseutil.exe /mh "C:\veeamflr\exchange\volume1\Program Files\Microsoft\Exchange Server\V15\Mailbox\Mailbox Database 1821327848\Mailbox Database 1821327848.edb"

 

It shows that the db is in dirty shutdown, matching the description of the explorer. So let's hard repair it without logs.

Now here is the tricky bit. When you start File Level Recovery, a cache file will be made holding all writes under:
C:\Windows\system32\config\systemprofile\AppData\Local\mount_cache{}


The cache will be deleted automatically but it might mean that when you are repairing, it could grow filling up your whole c: drive. If you are not sure, copy the EDB to a second location where you will have plenty of space. Also you will see that the recovery process might need upto 2x the space of the original EDB. This is because it will create a TMP file to work on. So plan for that as well.

In my scenario, I kept the file on the original location but I specified that the TMP file should be on another drive. To recover use eseutil.exe /p (optionally specifying the /t parameter for  the TMP file)  :
PS C:\eseextract> .\eseutil.exe /p "C:\veeamflr\exchange\volume1\Program Files\Microsoft\Exchange Server\V15\Mailbox\Mailbox Database 1821327848\Mailbox Database 1821327848.edb" /t "E:\tmp\tmp.edb"


So it will give you an error that you might potentially loose data. However, remember we are reading the backup in readonly and redirecting writes to the mount_cache file so no harm done.



After some time it should get recovered. You can then validate it again with the /mh parameter like so:
PS C:\eseextract> .\eseutil.exe /mh "C:\veeamflr\exchange\volume1\Program Files\Microsoft\Exchange Server\V15\Mailbox\Mailbox Database 1821327848\Mailbox Database 1821327848.edb"


Your EDB should be in clean shutdown. Now open up the Veeam Explorer for Exchange from the start menu. (If you can't find it, it's under: "C:\Program Files\Veeam\Backup and Replication\ExchangeExplorer\Veeam.Exchange.Explorer.exe" by default)

Then push "add store" and point to your EDB which is under the original EDB path we used with eseutil. In my case:
C:\VeeamFLR\exchange\Volume1\Program Files\Microsoft\Exchange Server\V15\Mailbox\Mailbox Database 1821327848\Mailbox Database 1821327848.edb


For the log directory, point to the directory holding the edb. You should now be able to click open, and get it to work


2014/08/21

Removing the SCOM 2012 R2 agent on a core edition

Recently I reinstalled the whole SCOM setup in my lab. Just because I wanted to test with the latest versions like R2 and needed to use the 180 trial license for that. This left me with a problem. Some of my servers kept reporting to my old SCOM server although it was obviously down. Re-adding the server to the current server didn't work. In the logs, I saw the following messages reappear (scom being my old server)
The description for Event ID 21006 from source OpsMgr Connector cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

scom

5723
10060L
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

So I decided to remove it from SCOM and then manually uninstall the agent. One problem, one of the servers is a core edition. Good lucking launch appwiz on that one. Luckily, it is quite easy to find out how you need to uninstall it. Launch regedit and go to
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\
For every installed program, there should be a subkey under which you can see the DisplayName and the UninstallString.

Alternatively, you can use the following script:
$(Get-ChildItem 'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\') | % { Write-host ("Soft : {0} `n`t {1}" -f $_.getvalue("DisplayName"),$_.getvalue("UninstallString")) }
It should output something like this
Soft :

Soft : Microsoft Visual C++ 2008 Redistributable - x64 9.0.30729.4148
         MsiExec.exe /X{4B6C7001-C7D6-3710-913E-5BC23FCE91E6}
Soft : VMware Tools
         MsiExec.exe /X{4D80C805-67C3-4525-A7BA-DC43215E9167}
Soft : Microsoft Monitoring Agent
         MsiExec.exe /I{786970C5-E6F6-4A41-B238-AE25D4B91EEA}
 So to uninstall the agent, first stop the service, just to be sure:
 net stop healthservice
 Then Uninstall the agent. I used the /X flag instead of the /I flag
 MsiExec.exe /X{786970C5-E6F6-4A41-B238-AE25D4B91EEA}
Btw at first, the command didn't want to do anything, rebooting the server helped. A GUI should appear requesting if you are sure you want to uninstall.

Then I deleted the directories under program files just to be sure that no residue was left on the filesystem:
rmdir "C:\Program Files\System Center Operations Manager" /s
rmdir "C:\Program Files\Microsoft Monitoring Agent" /s
Redeployed and let's hope I won't see those nasty error messages re-appear

2014/05/21

What is the buzz around Backup from Storage Snapshots anyway?

It is always great if vendors announce new features because most likely they are solving issue existing customers have. One of the better features Veeam released is Backup from Storage Snapshots. In v7, this feature supports HP Storeserv and HP Storevirtual storage platforms. As recently announced, in v8 this feature will be extended to Netapp.

But what problem does it really solve? When I am talking to customers, I have two kinds of customers: the ones that have the actual problem and the ones that don't. You can easily recognize them, cause if you have one of the first category, they immediately say : "We need this!".

So let's look at the problem first, and then explain how Backup from Storage Snapshots (BfSS) works.

With the introduction of Virtualisation, there are actually more layers that have to be made consistent before you can make a backup. In the old days, it was just the application, operating system and then the hardware (SAN) underneath that. Now, a new layer has been introduced: the Virtualisation layer itself.

Since Veeam backups at the VM level, it makes sense to take into account this layer. The way Veeam does it, is by taking VM snapshots (for VMware). To make everything consistent in the VM there are a couple of possibilities :
  • Use Veeam Application Aware Image Processing: basically talk to VSS directly via a runtime component. If necessary, it can also truncate logs for Exchange or SQL
  • Use VMware Tools: For Windows it will also do a (filesystem level) integration with VSS. For other platforms (or if you prefer), you can use pre-snapshot/post-snapshot scripts.
Once everything is consistent in the VM, Veeam then triggers a VMware snapshot. When that snapshot is created, everything can be released in the guest because you have a "consistent photo" of your VM. But what happens underneath?


Before the snapshot is created, the VM is happily reading and writing to the VMDK


After a snapshot has been created, VMware will create a delta disk. This disk will be very small in the beginning. However, while the snapshot exists (and thus the delta disk exists), the writes are redirected to this delta volume. The great advantage is that only reads will be done from the original VMDK if the blocks have not been overwritten. This means that we can backup the original VMDK knowing that it is in a consistent state and won't be altered during backup

Important, VMware snapshots are not "transaction logs". If a block is updated for a second time, the block in the spare disk will be updated thus not taking extra space. That means the delta can maximally grow to the size of the original VMDK.

Well so far, so good. But what is the problem with this?


If you have a not so I/O active VM, there is not really a problem. Because of change block tracking feature, Veeam only has to backup the blocks that have been changed between back-ups. That means fast backups and due to the low I/O, the delta won't grow so fast.

But what if you have an I/O active VM. Well then you have a couple of problems. First of all, your snapshots will grow with 16MB extents (or at least that is what I could find on the net). But everytime it grows, it needs to lock the clustered volume (VMFS) to allocate more space for the VMDK (Meta updates). That means extra I/O will be needed but also possible impact on other VMs that run on the same volume due to these locks. This problem also occurs with thin provisioning.

Secondly, if you are using thin provisioned VMFS volumes, the VMFS volumes will consume more and more space on the SAN. When you delete the snapshot, that space won't be automatically reclaimed. VMware now support the UNMAP VAAI primitive but as far as I know, it is not an automatic process:
http://cormachogan.com/2013/11/27/vsphere-5-5-storage-enhancements-part-4-unmap/

Finally because it is an I/O active VM, it probably has changed a lot of blocks between backups meaning that the VM backup might take long.

So if you could reduce the time the snapshot is active, the snapshot won't have the chance to grow that big. You might not avoid the problems completely but at least the impact will be a lot smaller.

But it can get worse. What happens when you delete (commit) the snapshot? Of course your data is not just discarded but need to be re-applied to the original volume. However, writes are still being done to that snapshot, so you can not just start merging. Because what happens to a block you are committing back and updating at the same time? Well for that VM uses a consolidated helper snapshots.


Basically VMware creates a second snapshot. All writes will be redirected to this helper. Then you can start committing the data back to the original VMDK.


Once that is done, the hope is of course that "the consolidated helper" snapshot is smaller then the original snapshot. So for example, if backup time took 4 hours, the hope is that consolidating only took for example 10 minutes, meaning that the snapshot might only be a fraction of the original snapshot.

What is important is to notice is that, the bigger the snapshot, the more extra I/O will be generated during commit. You need to read the blocks from the snapshot and then overwrite them in the original VMDK. That means that during a commit, you might notice a performance impact on the volume and thus on your original VM as well.

But what happens after that commit? You are left with consolidated helper, so you need to commit that. In 3.5, VMware just frozes the VM (holding off all I/O), and commited the delta file to the VMDK (call a sync commit). That means you can have huge freeze times (stuns). At one point, VMware improved this process by creating additional helper snapshots and going through the same process over and over again until it feels confident that it can commits the snapshots in a small time.

There are actually 2 parameters that impact this process.
  • snapshot.maxIterations : How many times will we repeat this process of creating helper snapshots and committing them. After all iterations are over, the VM will be stunned anyway and the commit will be forced. By default, VMware will go through 10 iterations max
  • snapshot.maxConsolidateTime : What is the estimated time VMware can stun your VM. The default is 6 seconds. For example, if after 3 iterations, VMware is confident it can commit the block of the helper snapshots in less then 6 seconds, it will freeze all I/O (stun), commit the snapshot, continue I/O (unstun) and not go through any additional iterations.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039754

So if you are running I/O intensive program, the impact might be huge if you have to go to several iterations. Also imagine that instead of getting smaller consolidaton helpers, you will get bigger helpers after several iteration, the stun time might become huge instead of smaller. In the KB article there is an example that if you start with 5 minutes stun time, you might actually end up with 30 minutes stun time.

As a side note, I have to thank my colleague Andreas for pointing us to these parameters. While they where undocumented back then, they helped me find to find the info I needed. His article describes the process of lowering Max Consolidate Time for Exchange DAG Cluster. Granted VMware might go to additional iterations but the result might be that the stun time will be smaller thus not causing any failover. Like he suggest as well, only do this together with support. If your I/O is too high, you might actually amplify the problem as described above.

Conclusion is that, if you keep the delta file small, commit will be much faster, will have to go to a lot less iterations and stun time might be minimized (even if you go through max iterations).

So how does BfSS helps then? Well, when you use BfSS, a storage snapshot will be created after the snapshot on VM level is created. That means you can then instantly delete the snapshot on a VM level.


So as you can see the start is the same.
 

But then you create a snapshot by talking to the SAN/NAS devices that is hosting the VMFS volume / NFS share . This means your VM snapshot is "included" in the SAN snapshot, and this allows you to instantly commit the snapshot on the VM level


Afterwards the Veeam Backup & Replication proxy can read the data directly via the storage controller. Granted, Veeam will still create a snapshot, but you can imagine that a delta of 2 minutes while be 100x times smaller than a delta of 3 hours.

Sometimes customers will ask me if you are not shifting the problem. From a thin provisioning perspective of course not because the SAN is aware of the blocks it deletes. From a performance impact, SAN arrays are designed to do this. In fact, snapshots are NetApp bread and butter. They just redirect pointer, so deleting a snapshot is just throwing away the pointers. So no nasty commit times there.

But there is another bonus with storage snapshots that will be exclusively available for Netapp. VMware has still not solved the stun-problem that you can have with VMs hosted on NFS volumes when using Hot-Add backup mode. Backup & Replication has a way around this, but still it requires you to deploy a proxy on each host.

With BfSS, v8 while also implement an NFS client in the proxy component for Netapp. That means, even though you use NFS, you can use a "Direct SAN" approach (or as I like to call it Direct NAS). First of all it means you won't have those nasty stuns but more importantly, you will read the data where it resides. That means no extra overhead on the ESXi side (no CPU/MEM needed!) when you are running your backups.

So although demoing this feature might not look impressive  (unless you have this problem of course), you can see that it a major feature that has been added to Veeam Backup & Replication. The impact of making backups on I/O intensive VM will be drastically lower, allowing you to do more backups and thus yielding a better RPO.

 *Edit* I also found that VMware has added a new parameters in one of its patches, but what snapshot.asyncConsolidate.forceSync = "FALSE" does is not described.