vSphere ESXi6.0 CBT (VADP) bug that affects incremental backups / snapshots.

VMware recently posted a new KB article 2136854 to advertise a new issue that has been found with their Changed Block Tracking (CBT) code.

It’s important to note that this is not the same one as posted recently also for ESXi 6.0 (KB 2114076) – now fixed in a re-issued build of ESXi 6.0 (Build 2715440)

But it is very similar to KB 2090639 from a historical perspective.

The Issue

If you are leveraging a product that uses VMware’s VADP for backup, then chances are you are leveraging this for not just initial fulls, but regular incremental snapshots (for backup purposes). There are numerous products on the market that leverage this API, it’s virtually the industry standard to use this feature as it results in faster backups.

When the incremental changes are being requested through the API (QueryDiskChangedAreas) the API is requested changed blocks, but unfortunately some of the changed blocks aren’t being correctly reported in the first place, so backup data is essentially missing. And backups based on this can be inconsistent when recovered and result in all sorts of problems.

The Challenge

Currently there is no resolution or hotfix to the issue from VMware. I hope that we will see something in the coming days due to the wide ranging impact to customers and partner products affected.

The Workarounds

The workarounds in the KB suggests:

  1. Do a full backup for each backup, and that will certainly work, but it’s not really a viable fix for most customers (ouch !)
  2. Downgrade to ESX 5.5 and virtual hardware back to 10 (ouch !)
  3. Shutdown the VM before doing an incremental  (ouch !)

From the testing we have done at Actifio, option 3 doesn’t actually provide a workaround either, and options 1 & 2 aren’t really ideal.

The Discovery

When Actifio Customer Success Engineers discovered the issue, we contacted VMware and proved the problem leveraging just API calls to demonstrate where the problem was. How did we discover the issue I hear you ask?  Well we managed to discover the issue via our patented fingerprinting feature that occurs post every backup job. This technique (feature) essentially has learnt to not trust the data we receive (history has proven this feature to be useful many times) but to go and verify it against our copy and the original source copy. If we receive a variance in any way, we trigger an immediate full read compare against the source and update our copy. This works like a Full Backup job, but doesn’t write out a complete copy again, it just updates our copy to line up with the source again (as we like to save disk where we can!). We’ve seen this occur from time to time with our many different capture techniques (not just VADP), so it’s a worthy bit of code to say the least that our customers benefit from.

Let’s hope there’s a hotfix on the near horizon, so the many VADP / CBT vendor products that rely on it, can get back to doing what we do best and that’s protecting critical data for our customers that can be recovered without question.

—-

Thanks to Jeff O’Connor for writing this up.   You can find his blog here:  http://copydata.tips

Advertisements

About Anthony Vandewerdt

I am an IT Professional who lives and works in Melbourne Australia. This blog is totally my own work. It does not represent the views of any corporation. Constructive and useful comments are very very welcome.
This entry was posted in Actifio, vmware. Bookmark the permalink.

2 Responses to vSphere ESXi6.0 CBT (VADP) bug that affects incremental backups / snapshots.

  1. John says:

    Hi Tony, thanks for the update, the question has to be asked of the reliability of using CBT. The problem is that the likes of Actifio and the plethora of other backup products which rely on CBT for their product success all share what is becoming the weakest link in the chain. For VMware backup/restore of files and VMs is not their core focus unlike CBT users.
    This CBT reliability problem has reared its ugly head before and brings into the question of being able to reliable restore data with confidence.
    Are VMware able to offer reassurance of the CBT reliability.

  2. I totally understand your concerns. It also bothers me that the larger vendors didn’t pick up on this. Veeam, Networker, TSM, Netback. I think they also depend on CBT and yet didn’t discover this during their own QA.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s