Evergreen Storage? Can it actually work?

Pure Storage is one of several hot flash vendors in the market right now.   Despite some negativity about their recent IPO, it actually shows that the market thinks they have got their product and execution right.

One challenge for every Flash vendor out there (and there are quite a few) is to be able to explain the why.   Why my product and not another vendors?

One thing Pure Storage promote as a strong ‘why us‘  is their concept of Evergreen Storage, described here:


Fundamentally they are saying that as technology evolves, their modular physical design and stateless software design will allow you to upgrade components without having to move data or do any of these forklift upgrades.  Here is an image from their brochure:


Even with Storage vMotion, the need to move data between storage arrays remains a major additional cost of replacing or upgrading storage hardware, and the ability to minimise or eliminate this work is definitely a huge plus.

But can they actually do it?  Do we have working examples of other vendors achieving this?

There is actually a good working model of a product that has done exactly this since 2003: The IBM SAN Volume Controller.     When IBM released the SVC in 2003, the first model (the 4F2), had only 4 GB of RAM per node with 2 Gbps FC adapters.   Since then, IBM have released a succession of new models as Intel hardware has evolved, with the current nodes having at least 32 GB of RAM, dramatically more cores, and optional 16 Gbps FC adapters!

The neat thing is that clients who invested in licensing in 2003, have been able to upgrade their nodes, with data in place, over successive years.   The cost of new nodes has been relatively low compared to the performance and functional benefits that each release has provided.   So I know for a fact that this idea of an Evergreen storage product is not only possible, but positively demonstrated by IBM.

The challenge for any vendor trying to do this is three fold:

  1. The technology really has to support seamless upgrades.   While the IBM SVC certainly did and does, there were some minor hiccups along the way.   One example was that first model, the 4F2, could not support the later 64 bit firmware releases, which meant that if you held off upgrading for too long, upgrading to new hardware needed some special help or a double hop to get the upgrade going.    Another example is bad racking:   Racked and stacked badly, pulling one node out could result in a partner node being disturbed (something I sadly have seen).
  2. The vendor needs to remain committed to the product.   While I laud IBM’s success with the SVC (now going even stronger with its Storwize brothers),  a sister product released at the same time, the Storage File System (sometimes called Storage Tank), did not get market traction and did not progress very far before being replaced by GPFS (which was not exactly a one for one replacement).  And while the DS8000 continues going strong (long after Chuck Hollis, in a classic piece of EMC FUD,  declared it dead),  its little sister, the DS6800, truly was dead within months of being released.   Its early months were so drama laden (sometimes sadly referred to as a crit-sit in a box) that new models were never released, which was equally sad, as once the code stabilised it became a great product.
  3. The vendor needs to hang around.   This one seems fairly obvious.   Clearly if someone were to buy Pure Storage (if the structure of the company allowed someone to do this), they also need to support this strategy.

So can Pure Storage do it?   Only time will tell, but they have made a great start and the industry has shown the concept is possible.   I will watch their progress with great interest!


Posted in Uncategorized | 3 Comments

vSphere ESXi6.0 CBT (VADP) bug that affects incremental backups / snapshots.

VMware recently posted a new KB article 2136854 to advertise a new issue that has been found with their Changed Block Tracking (CBT) code.

It’s important to note that this is not the same one as posted recently also for ESXi 6.0 (KB 2114076) – now fixed in a re-issued build of ESXi 6.0 (Build 2715440)

But it is very similar to KB 2090639 from a historical perspective.

The Issue

If you are leveraging a product that uses VMware’s VADP for backup, then chances are you are leveraging this for not just initial fulls, but regular incremental snapshots (for backup purposes). There are numerous products on the market that leverage this API, it’s virtually the industry standard to use this feature as it results in faster backups.

When the incremental changes are being requested through the API (QueryDiskChangedAreas) the API is requested changed blocks, but unfortunately some of the changed blocks aren’t being correctly reported in the first place, so backup data is essentially missing. And backups based on this can be inconsistent when recovered and result in all sorts of problems.

The Challenge

Currently there is no resolution or hotfix to the issue from VMware. I hope that we will see something in the coming days due to the wide ranging impact to customers and partner products affected.

The Workarounds

The workarounds in the KB suggests:

  1. Do a full backup for each backup, and that will certainly work, but it’s not really a viable fix for most customers (ouch !)
  2. Downgrade to ESX 5.5 and virtual hardware back to 10 (ouch !)
  3. Shutdown the VM before doing an incremental  (ouch !)

From the testing we have done at Actifio, option 3 doesn’t actually provide a workaround either, and options 1 & 2 aren’t really ideal.

The Discovery

When Actifio Customer Success Engineers discovered the issue, we contacted VMware and proved the problem leveraging just API calls to demonstrate where the problem was. How did we discover the issue I hear you ask?  Well we managed to discover the issue via our patented fingerprinting feature that occurs post every backup job. This technique (feature) essentially has learnt to not trust the data we receive (history has proven this feature to be useful many times) but to go and verify it against our copy and the original source copy. If we receive a variance in any way, we trigger an immediate full read compare against the source and update our copy. This works like a Full Backup job, but doesn’t write out a complete copy again, it just updates our copy to line up with the source again (as we like to save disk where we can!). We’ve seen this occur from time to time with our many different capture techniques (not just VADP), so it’s a worthy bit of code to say the least that our customers benefit from.

Let’s hope there’s a hotfix on the near horizon, so the many VADP / CBT vendor products that rely on it, can get back to doing what we do best and that’s protecting critical data for our customers that can be recovered without question.


Thanks to Jeff O’Connor for writing this up.   You can find his blog here:  http://copydata.tips

Posted in Actifio, vmware | 2 Comments

Accessing the Instrumentation

Here are some rather bad photos of my 1972 Holden HQ Kingswood Premier, one of my first ever cars (and one that I sadly no longer own):

IMG_0001 IMG_0009

This was the V8 Four Litre model (actually 253 cubic inches, often jokingly described as having all the power of a 6 cylinder with the fuel economy of an 8).   The engine bay was so huge and empty I could open the bonnet and sit on the side of the car with my feet comfortably inside the bay while I changed spark plugs or cleaned the points.

The Kingswood was not what you would call an instrumented vehicle.   The dashboard had a speedo, a fuel gauge and three lights:   Temperature, Oil and Charging.   I dubbed these three lights the idiot lights: as once they came on,  you were the idiot.  (sorry, no picture; this was the 1980s).

Modern storage infrastructure by comparison is slightly more instrumented.   A vast array of metrics are tracked and these can be used to perform all sorts of analysis.   Analysis like:

  • Are my hosts getting good response times?
  • Are specific disks or arrays being over worked?
  • Are my fibre ports being used in a balanced fashion?

So can you do this with the Storwize products?    Of course!   I documented the built-in tools here (where I talked about the Performance GUI):


And here (where I talked about the performance CLI):


But these tools have only limited usefulness.   They are not granular, in that you cannot look at specific hosts or specific arrays or specific FC ports (meaning the three analysis ideas I suggested above are not even possible).   So how can we do this analysis?

The good news is that Storwize products do track all the metrics needed to do very granular analysis and these are freely accessible.   These files are documented by IBM, here is a fairly old page that documents some of them:


But how to turn these into something useful?   There used to be a tool called svcmon but this tool appears to have been killed as per this rather sad blog post:


There is another IBM Community developed tool called qperf which you can access using the link below:


With a graphing tool here:


And another tool here:  http://www.stor2rrd.com/

And yet another one here!   https://code.google.com/p/svc-perf/

The challenge for many of these tools is that they require manual setup, usually have a limited database engine and analysis is not always easy or simple.

You can of course use IBM’s TPC:


You could also consider Intellimagic.   Although I have not looked too deeply at this one, these guys wrote IBM’s Disk Magic tool, so they certainly understand storage performance


The challenge for all Storage Admins is that they are not always experts at diagnosing performance issues.    Getting some genuine examples of the thinking process and the flow of getting from problem to solution, is vital.   This makes  BVQ  another good choice.

To see an example of how instrumented data presented in a graphical format can be used to generate a useful problem analysis, check out this blog post here:


and another one here:


I really like these posts for two reasons:

  1. They clearly shows just how instrumented the product is
  2. They clearly show how using this data in a graphical format can lead to good and quick root cause analysis.

Also have a look at some of these videos:


So how are you instrumenting your Storwize?
What do you find the easiest tool to use?




Posted in Uncategorized | 5 Comments

DevOps Culture lessons for all of us?

A colleague of mine recently pointed me at this fascinating document:


It seems everyone is talking about DevOps right now, and if we accept even half of what this report tells us, to ignore what it is saying would be an opportunity lost.   In essence the report suggests:

High-performing IT organizations experience 60 times fewer failures and recover from failure 168 times faster than their lower-performing peers. They also deploy 30 times more frequently with 200 times shorter lead times.

Can we believe these numbers?   There is a risk that measured results don’t totally equate to the benefits of DevOps adoption.   After all many of the early adopters in this space were in fast growing or new-market sectors.   They were probably going to grow rapidly regardless, simply because they were already innovating in new areas.    However what they do clearly show is that if you can achieve high rates of change, but with lower levels of risk, you can adopt faster to market needs and customer demands.   And that is what would provide you with competitive advantage.

A simple example of speed to market I struck literally days ago, after upgrading my iPhone to iOS 9 and then finding my Banks iPhone app kept crashing.  They had missed the upgrade boat so to speak and took a week to catch up.   In some ways I should be pleased it was only a week, but in today’s economy, seven days is a lifetime.

However the part of the report that really struck a chord for me was the section titled Why Culture Matters.   EVERYONE in every company should read this section.   Print out these tables.  Tape them to the desk or wall.   Bring them to meetings.   Reflect loudly….  what kind of manager do you have?   What kind of culture is your management team engendering?


More importantly are these strategies being followed?


On the same day, I read/listened to this:


The fundamental message being that providing employees with places to gather and chat informally can generate huge benefits.   Can this even occur in companies who don’t even provide their workers with tea and coffee?

I finished my day reading the DevOpsGuys blog.   I loved this discussion of technical debt (and the blog from Box referenced in the comments).   How many organizations out there are burdened by technical debt?

Don’t know what I mean?   Read the blog:



Posted in Uncategorized | Leave a comment

Do not install ESXi 5.5 Update 3 if you rely on VMware snapshots

ESXi 5.5 Update 3 was released on September 16, 2015.   Since it was released it has emerged that after upgrading an ESXi host to this update, a snapshot consolidation task can result in the relevant Virtual Machine suffering an outage.

This disruptive issue occurs due to a segmentation fault when changing the snapshot tree data-structure.

More details are here:

Snapshot consolidation causes virtual machines running on VMware ESXi 5.5 Update 3 hosts to fail with the error: Unexpected signal: 11 (2133118)

Clearly you should not install this update if your data protection software relies on VMware snapshots.   If you have already installed it, consult the VMware link above for a work around strategy or suspend your snapshot scheduler (which you may need to do from your data protection software) while we wait for a fix from VMware.


Posted in vmware | Tagged | 1 Comment

Making sense of the IBM SVC/Storwize Code release cycle

UPDATE 14 August 2015
When I initially posted this blog, there was a major error in my base spreadsheet, that made the time periods shorter than they actually were (because it was only using weekdays, 5 days per week, not 7 days per week, which was my mistake).
I withdrew the post and corrected it and this is an edited update.   The conclusions remain pretty well the same, but the time periods are larger than initially stated.

A common question I get is this:

A new version of code has just been released for my SVC or Storwize product, should I upgrade or should I wait?

The challenge for many customers is that these upgrades:

  • Need change windows
  • Cannot be backed out
  • Rely on redundancy to avoid downtime
  • Take over an hour to complete

So when is the right point to get the most reliability and access to new features and hardware support, but with the least number of change windows?  It occurred to me only recently that rather than rely on a bunch of potentially subjective or gut feel decision points, could I just use maths here?   Go all Freakonomics on this subject.

Now it turns out IBM actually made this fairly easy as they publish the build dates for the entire release history right here:


They are usually in a format like 115.51.1507081154000  where the 1507081154000 can be read as  11:54 on the 8th of July 2015.      Now I don’t work for IBM engineering, but my take is that if they publish a build date then I am fairly confident that this will be when the code was built from source.   Normally it is then fed to QA and if it passes their sometimes real world tests (I am only being slightly sarcastic), it hits the field release process.   So the build date is not when it was released, but when it was built.

So I took all of these dates and put them into a spreadsheet and then calculated time periods between builds (which I will call release dates, knowing they were NOT the actual date of release).

I considered three periods:

  • The time period between major releases  (i.e. days between between 7.3 and 7.4).   I made an executive decision to treat releases 4.1.0 and 4.1.1 as major even though they are probably not.   You will see why as we go through.   This metric shows how often major releases come out.
  • Days between updates within a release (for instance how many days between and compared to how many days between and   This metric shows how often patches come out.
  • Days between the build date of each update and the build date of its major release.  For instance, days between and versus days between and  This metric shows the patch release lifecycle of each release.

I then graphed each metric to get a visual impression of these.   Lets check them out…

Time period between major releases

This one is interesting as the trend is clear, a new major release is coming out roughly every 180 days.   This shows that a sixth monthly release cycle is definitely being worked to.


However there is a large glitch in the center, which makes you wonder whether there were some delays in certain releases.    We know that Release 5.0 had 64 bit changes and Release 6.0 brought the Storwize platform into play, so that helps explain that spike.


Note that doesn’t appear as there was no release before that!

Days between updates within a release

This one is quite interesting.   If you go to a release, how many days will pass before another build hits the field?   In other words, I just upgraded… how many days will pass before I may potentially have to upgrade again (presuming I am determined to always run the latest release).

The short answer is that the interval based on the trend line Excel added, shows it is becoming shorter as time goes by.  It started at 70 days and is now closer to 45 days. However that’s using a linear trendline, a logarithmic trendline is much flatter and stays almost constantly on 50 days.


Days between updates from the build of that major release

So this is the most interesting graph of all.    Once a major release comes out, a series of patch releases come out for that minor release.  How quickly do they get released?   If the first five updates come out in the first 60 days and then there is a 30 day gap, does that mean I should wait 90 days?


This endless series of hills shows the way release histories run.   At the start of each cycle there are lots of releases, each one coming out on a slightly longer period than the previous one.   Usually there is always a very late update, probably a roll-up for the slow upgraders (thus the sudden spike to the top of each hill).   It also shows the effective lifespan of a code release is around 500-550 days as new updates simply don’t come out after a certain point.   The more recent code levels taper off on those peaks, as they are not that old yet.

This leaves the one killer question… how long should I wait?    I looked at just release 7.1 to 7.5 to see if I could work out just from the release cycle, when to jump.   The red line shows the days between each release while the blue line shows the cumulative days within the release.   From a user perspective, the higher the red line gets the better, as this means less code churn.   So I look for the first major red peak and then find the matching point on the blue graph to see when that occurred.


I was looking for when the red line gets over 60 days and stays there, which typically occurs between 120 and 250 days after a release.   I think waiting for 5 months is a very long interval, it really depends on how conservative you want to be.   I do note that release 7.4 has the best correlation between the red and blue graphs, which is a very good sign.

So what can we learn from all of this?

I certainly learnt several things:

  • Major release cycles are six monthly
  • Patch updates taper off as the release ages and eventually stop after around 18 months
  • Patch updates are coming out  on average every 50 days
  • It takes between 150 and 250 days before new patch release intervals start to slow down within each build

What I need to do next is look at the incidence of hyper releases (ones that have severe impact) to see if we can add any extra metrics to this study.

For reference, attached it the spreadsheet I built with the IBM release dates.









Posted in IBM, IBM Storage, Storwize V3700, Storwize V7000, SVC, Uncategorized | 6 Comments

Unix tools for the SVC and Storwize masses: a 12 year journey

So it turns out the only thing harder than writing blog posts is keeping up with other people’s blog posts  (actually it’s easy to write blog posts….  it’s just hard to write good and useful ones).   So when I ranted last week about the lack of Unix tools in the SVC/Storwize rbash shell, it turns out I had missed the memo…  they have finally arrived!    Barry Whyte revealed all here:


It has been an interesting journey to watch the SVC code base mature and expand its horizons as it has gone from a straight storage virtualization platform to a storage controller platform as well.   In the process the effective user base for this platform has increased dramatically, I suspect well in excess of 10x fold.    And with that increase comes much more pressure on usability.     In the CLI space, the first big thing to change was that IBM dropped the requirement to use private/public keys to login to the CLI shell.   The fact of the matter is that many storage admins are not UNIX experts…  they don’t use PuTTY in their daily lives and running PuttyGen to generate keys felt positively strange to them.   They just wanted to type a username and password and be logged in….  and that is precisely what they got.

The need to preface info and task types commands with svcinfo and svctask also went by the wayside, which made things simpler as well (although I usually leave them in sample scripts for the ultimate backwards compatibility).

But in the opposite direction was the restricted bash shell that the CLI user gets to live in. Unix users (particularly those power users who know how to bash out a complex awk command in 40 keystrokes), were all a little stunned that they had to run all the cool commands outside the SSH command.    Telling them it has been that way for 12 years didn’t make them feel any better.

So with the 7.5 release (which is still fairly steaming new), you get 11 Unix shell commands that are all very very sweet:


So for a simple example, if I want to grab just the VDisk IDs for all VDisks I would normally use a command like this one:   lsvdisk -delim , -nohdr | cut -d”,” -f1

Run on a pre 7.5 machine I get:

IBM_Storwize:anthonyv>  lsvdisk -delim , -nohdr | cut -d"," -f1
rbash: grep: command not found

On a 7.5 machine I get:

IBM_Storwize:anthonyv>lsvdisk -delim , -nohdr | cut -d"," -f1

What’s nice is that a Windows admin firing commands with plink can now get unix commands run on the remote side with just a Windows plink command on the local side.

My only complaint is lack of floating point support.   Bash by default does not handle floating point.   To prove this, here is an amusing example where we get bash to perform some division:

IBM_Storwize:anthonyv>echo "4 divided by 2 equals $(( 4/2 ))"
4 divided by 2 equals 2

IBM_Storwize:anthonyv>echo "2 divided by 4 equals $(( 2/4 ))"
2 divided by 4 equals 0

I normally use awk to handle floating point calculations, so if you were to sum the size in bytes of a number of MDisks and then divide by 1024 to get GiB you may want to get a result to say three decimal places, but with bash built ins and the tools so far exposed to the rbash user, you cannot do this.  The bc command can also handle floating point, so next time you see your IBM rep, say ‘nice try but you missed floating point’.

Let me know what they say.    #;-)

Posted in IBM, IBM Storage, Storwize V3700, Storwize V7000, SVC | 1 Comment