2017/02/22

Under the hood: How does ReFS block cloning works

In the latest version of Veeam 9.5, there is a new feature called ReFS Block cloning integration. It seems that the ReFS Block cloning limitations confuse a lot of people so I decided to look a bit under the hood. It turns out that most limitations are just basic limitation of the API itself.

To understand better how it all works, I made a small project called Refs-fclone . The idea is to give it a source file (existing) and then to duplicate that file to a target file (non existing) with the API. Basically creating a synthetic full from an existing VBK. 

It turns out that idea was not so original. During my google quests for more information (because some parts didn't worked), it appeared that there was a fellow hacker that made the exact tool. I must admit that I reused some of his code so you can find his original code here

Nevertheless I finished he project, just to figure out for myself how it all works. In the end, the API is pretty "easy" I would say. I don't want to go over the complete code but highlight some important bits. If you don't understand the C++ code, just ignore it and read the text underneath it. I tried to put the important parts in bold.

Prerequirements

Before I even got started, my initial code did not want to compile. I couldn't figure it out because I had the correct references in place. But for some reason, it could not find "FSCTL_DUPLICATE_EXTENTS_TO_FILE". So I start looking into my projects settings. Turned out, it was set to compile with Windows 8.1 as a target and when I changed it to 10.0.10586.0, all of the sudden it could find all reference. 

This shows an important lesson. This code is not meant to be ran on Windows 2012 because it just doesn't have the API call supported. So many customers have been asking, will the ReFS integration work on Windows 2012 and the answer is simple: NO. At the time it was developed, the API call didn't exist. Also, you will need to have the underlying volume formatted with Windows 2016 because again, the ReFS version in 2012 did not support this API call.


So let's look at the code. First, before you clone blocks, there are some requirements which I want to highlight in the code itself:
FILE_END_OF_FILE_INFO preallocsz = { filesz };
SetFileInformationByHandle(tgthandle, FileEndOfFileInfo, &preallocsz, sizeof(preallocsz));
This bit of code defines the end of the file. Basically it tells windows how big the file should be. In this case, the size if filesz which is the original file size. Why is that important? Well to use the block clone API, we need to tell it where it should copy it data to. Basically a starting point + how much data we want to copy. But this starting point has to exist, so if want to make a complete copy, we have to resize it to be a big as the original

if (filebasicinfo.FileAttributes | FILE_ATTRIBUTE_SPARSE_FILE) {
FILE_SET_SPARSE_BUFFER sparse = { true };
DeviceIoControl(tgthandle, FSCTL_SET_SPARSE, &sparse, sizeof(sparse), NULL, 0, dummyptr, NULL);
}
Next bit is the sparse part. The "if" statements basically check if the source file is a sparse file, and if it is, we should make the target sparse (tgthandle) as well. So what it is a sparse file? Well basically if a file is not a sparse file, it will allocate all the data on disks if you resize it. Even if you didn't write anything to it yet. A sparse file only allocates space, when you write non zero data somewhere. So even if it looks like it is 15GB big, it might only consume 100MB on disk but space is not really allocated. 

Why is that important? Well again, the API requires that source and target files need to have the same setting. This code actually runs before the resizing part. The reason is simple, if you do not make it sparse, the file will allocate all the space on disk, even if we didn't write to it. Not a great way to make space-less fulls.

if (DeviceIoControl(srchandle, FSCTL_GET_INTEGRITY_INFORMATION, NULL, 0, &integinfo, sizeof(integinfo), &written, NULL)) {
DeviceIoControl(tgthandle, FSCTL_SET_INTEGRITY_INFORMATION, &integinfo, sizeof(integinfo), NULL, 0, dummyptr, NULL);
}
Finally this bit. Basically it get the integrity stream information from the source file and then copies it to the target file.  Again, they have to be the same for the code to allow for block cloning.

This shows that basically the source and target file have to be pretty much the same.  This partially explains why you need to have an Active Full on your chain before block cloning starts  to work. The old backup files might not have been created with ReFS in mind!

Also for integrity streams to work, we don't need to do anything fancy. We just need to tell ReFS, this file should be checked. 


The Cool Part

for (LONGLONG cpoffset = 0; cpoffset < filesz.QuadPart; cpoffset += CLONESZ) {
LONGLONG cpblocks = CLONESZ;
if ((cpoffset + cpblocks) > filesz.QuadPart) {
cpblocks = filesz.QuadPart - cpoffset;
}
DUPLICATE_EXTENTS_DATA clonestruct = { srchandle };
clonestruct.FileHandle = srchandle;
clonestruct.ByteCount.QuadPart = cpblocks;
clonestruct.SourceFileOffset.QuadPart = cpoffset;
clonestruct.TargetFileOffset.QuadPart = cpoffset;
DeviceIoControl(tgthandle, FSCTL_DUPLICATE_EXTENTS_TO_FILE, &clonestruct, sizeof(clonestruct), NULL, 0, dummyptr, NULL);
}
That's it. That is all what is required to do the real cloning. So how does it work. Well first there is a for loop that goes over all the chunks of data of the source files. There is one limitation with the block clone API. You can only copy a chunk of 4GB at a time. In this project CLONESZ is defined as 1GB to be on the safe side.

So imagine you have a file of 3.5GB. The for loop will calculates that the first chunk starts a 0 bytes and the amount of data we want to copy is 1GB. Next time, it will calculate the the next chunck starts at 1GB and we need to copy 1 GB, and so on..

However the forth time, it actually detects that there is only 500GB remaining, and instead of copying 1GB, we copy only what is remaining (filesize - where we are now).

But how do we call the API? well first we need to create a struct (think of it is as a set of variables). The first variable references the original file. Bytecount says how much data we want to copy (mostly 1GB). Finally the source and target file offset are filed in with the correct starting point. Since we want duplicates, the starting point for the block clone is the same.

Finally we just tell windows to execute the "FSCTL_DUPLICATE_EXTENTS_TO_FILE" on the target file, which basically invokes the API. We give it the set of variables we filled in correctly. So basically the clone API itself + filling in the variables is only 5 lines of code.

The important bit here, is that you can not just copy files on a ReFS volume and expect ReFS to do the block cloning. An application really has to tell ReFS to clone data from one file to the other and both files have to be on the same disk.

This has one advantage though. The API just clones data even if Veeam has compressed that data or encrypted it. Since Veeam actively tells ReFS to clone the data, it doesn't have to figure out what data is duplicate, it just does the job. That is a major advantage against deduplication: you can still secure and compress your files. Also, since the clone is just a simple call during the backup, it doesn't require any post processing. And no post processing means no exuberant CPU usage or extra I/O to execute the call.

Seeing it action

This is how E:\CP looks like before executing refs-fclone. Nothing special. An empty directory and the ReFS volume has 23GB free


Now lets copy a VBK to E:\CP with the tool. It shows that the source file is around 15GB big and it is cloning 1GB at the time. Interestingly enough you see that last run, it just copies the remainder of the data.


This run took around 5 seconds max to execute this "copy". Seems like nothing really happened. However, if we check the result on disk, we see something interesting: 


The free disk space is still 23GB. However we can see that a new file is created that is 15GB+. Checksumming both files give exactly the same result.

Why is this result significant? Well it shows that the interface to the block clone API is pretty straight forward. It also means that although it looks like Veeam is cloning the data, it is actually ReFS that manages everything under the hood. From a Veeam perspective (and also end-user perspective), the end result looks exactly like a complete full on disk. So once the block clone API call is made, there is no way to undo it or to get statistics about it. All of the complexity is hidden.

Why do we need aligned blocks?

Finally I want to share you this result


In the beginning, I made a small file that had some random text like this. In this example, it has 10 letters in it, which means it is 10 Bytes on disk. When I tried the tool, it didn't work (as you can see), but the tool did work on Veeam backup files. 

So why doesn't it work. Well the clone API has another important limitation. Your clone regions must match a complete set of clusters. By default the cluster size is 4KB (although for Veeam it is strongly recommended to use 64KB to avoid some issues). So if I want to make a call, the starting point has to be a multiple of 4KB. Well 0 in a sense is a multiple of 4KB so that's OK. However the amount of bytes you want to copy, also has to be a multiple of 4KB, and 10B is clearly not. When I padded the file, to be exactly 4KB (so adding 4096 chars), everything worked again.

This show a very important limitation. For block cloning to work, the data has to be aligned, since you can not copy unaligned data. Veeam backup files are by default not aligned. Thus it is required to run an active full before the block clone API can be used. To give you a visual idea what this means. On top "a default" Veeam Backup file, at the bottom, an aligned file which is required for ReFS integration



Due to compression, data blocks are not always the same size. So to save space, they just can be appended after each other. However for the block clone API, we need align the blocks. The result is that we sometimes have to pad a sector with empty data. So why do we need to align, can we just not clone more data? After all it doesn't consume more space?

Well take for example the third block. Unaligned, it is cluster 2,3,4. Aligned it is only in cluster 3 and 4. So because of the aligned blocks, we have to clone less data. You might think, why does it matter because cloning does not take extra space? 

Well first of all it keeps the files more manageable without filling it with junk data. If you copy 2 and 4 from the unaligned file, you basically add data that is not required. Next, if you delete the original file, the data does start "using space on disk". Because of the reference, you basically tell ReFS not to delete the data blocks as long as they are referenced by some file. So the longer these chain continue, the more junk data you might have.

So this is the reason why you need an active full. A full backup has to be created with ReFS in mind, otherwise the blocks are not aligned and in this case Veeam refuses to use the API

If you want to read more about block size I do recommend this article from my colleague Luca

One more thing

Here is a fun idea. You could use the tool together with a post process script to create GFS point on a Primary Chain. Although not recommended, you could for example run a script every month that "clones" the last VBK to a separate folder. The clone is instant so doesn't take a lot of time or much extra space. You could script your own retention or manually delete files. Clearly this is not really supported but it would be a cool idea to keep for example one full VBK as a monthly full for a couple of years