Do not install ESXi 5.5 Update 3 if you rely on VMware snapshots

ESXi 5.5 Update 3 was released on September 16, 2015.   Since it was released it has emerged that after upgrading an ESXi host to this update, a snapshot consolidation task can result in the relevant Virtual Machine suffering an outage.

This disruptive issue occurs due to a segmentation fault when changing the snapshot tree data-structure.

More details are here:

Snapshot consolidation causes virtual machines running on VMware ESXi 5.5 Update 3 hosts to fail with the error: Unexpected signal: 11 (2133118)

Clearly you should not install this update if your data protection software relies on VMware snapshots.   If you have already installed it, consult the VMware link above for a work around strategy or suspend your snapshot scheduler (which you may need to do from your data protection software) while we wait for a fix from VMware.


Posted in vmware | Tagged | 1 Comment

Making sense of the IBM SVC/Storwize Code release cycle

UPDATE 14 August 2015
When I initially posted this blog, there was a major error in my base spreadsheet, that made the time periods shorter than they actually were (because it was only using weekdays, 5 days per week, not 7 days per week, which was my mistake).
I withdrew the post and corrected it and this is an edited update.   The conclusions remain pretty well the same, but the time periods are larger than initially stated.

A common question I get is this:

A new version of code has just been released for my SVC or Storwize product, should I upgrade or should I wait?

The challenge for many customers is that these upgrades:

  • Need change windows
  • Cannot be backed out
  • Rely on redundancy to avoid downtime
  • Take over an hour to complete

So when is the right point to get the most reliability and access to new features and hardware support, but with the least number of change windows?  It occurred to me only recently that rather than rely on a bunch of potentially subjective or gut feel decision points, could I just use maths here?   Go all Freakonomics on this subject.

Now it turns out IBM actually made this fairly easy as they publish the build dates for the entire release history right here:

They are usually in a format like 115.51.1507081154000  where the 1507081154000 can be read as  11:54 on the 8th of July 2015.      Now I don’t work for IBM engineering, but my take is that if they publish a build date then I am fairly confident that this will be when the code was built from source.   Normally it is then fed to QA and if it passes their sometimes real world tests (I am only being slightly sarcastic), it hits the field release process.   So the build date is not when it was released, but when it was built.

So I took all of these dates and put them into a spreadsheet and then calculated time periods between builds (which I will call release dates, knowing they were NOT the actual date of release).

I considered three periods:

  • The time period between major releases  (i.e. days between between 7.3 and 7.4).   I made an executive decision to treat releases 4.1.0 and 4.1.1 as major even though they are probably not.   You will see why as we go through.   This metric shows how often major releases come out.
  • Days between updates within a release (for instance how many days between and compared to how many days between and   This metric shows how often patches come out.
  • Days between the build date of each update and the build date of its major release.  For instance, days between and versus days between and  This metric shows the patch release lifecycle of each release.

I then graphed each metric to get a visual impression of these.   Lets check them out…

Time period between major releases

This one is interesting as the trend is clear, a new major release is coming out roughly every 180 days.   This shows that a sixth monthly release cycle is definitely being worked to.


However there is a large glitch in the center, which makes you wonder whether there were some delays in certain releases.    We know that Release 5.0 had 64 bit changes and Release 6.0 brought the Storwize platform into play, so that helps explain that spike.


Note that doesn’t appear as there was no release before that!

Days between updates within a release

This one is quite interesting.   If you go to a release, how many days will pass before another build hits the field?   In other words, I just upgraded… how many days will pass before I may potentially have to upgrade again (presuming I am determined to always run the latest release).

The short answer is that the interval based on the trend line Excel added, shows it is becoming shorter as time goes by.  It started at 70 days and is now closer to 45 days. However that’s using a linear trendline, a logarithmic trendline is much flatter and stays almost constantly on 50 days.


Days between updates from the build of that major release

So this is the most interesting graph of all.    Once a major release comes out, a series of patch releases come out for that minor release.  How quickly do they get released?   If the first five updates come out in the first 60 days and then there is a 30 day gap, does that mean I should wait 90 days?


This endless series of hills shows the way release histories run.   At the start of each cycle there are lots of releases, each one coming out on a slightly longer period than the previous one.   Usually there is always a very late update, probably a roll-up for the slow upgraders (thus the sudden spike to the top of each hill).   It also shows the effective lifespan of a code release is around 500-550 days as new updates simply don’t come out after a certain point.   The more recent code levels taper off on those peaks, as they are not that old yet.

This leaves the one killer question… how long should I wait?    I looked at just release 7.1 to 7.5 to see if I could work out just from the release cycle, when to jump.   The red line shows the days between each release while the blue line shows the cumulative days within the release.   From a user perspective, the higher the red line gets the better, as this means less code churn.   So I look for the first major red peak and then find the matching point on the blue graph to see when that occurred.


I was looking for when the red line gets over 60 days and stays there, which typically occurs between 120 and 250 days after a release.   I think waiting for 5 months is a very long interval, it really depends on how conservative you want to be.   I do note that release 7.4 has the best correlation between the red and blue graphs, which is a very good sign.

So what can we learn from all of this?

I certainly learnt several things:

  • Major release cycles are six monthly
  • Patch updates taper off as the release ages and eventually stop after around 18 months
  • Patch updates are coming out  on average every 50 days
  • It takes between 150 and 250 days before new patch release intervals start to slow down within each build

What I need to do next is look at the incidence of hyper releases (ones that have severe impact) to see if we can add any extra metrics to this study.

For reference, attached it the spreadsheet I built with the IBM release dates.









Posted in IBM, IBM Storage, Storwize V3700, Storwize V7000, SVC, Uncategorized | 6 Comments

Unix tools for the SVC and Storwize masses: a 12 year journey

So it turns out the only thing harder than writing blog posts is keeping up with other people’s blog posts  (actually it’s easy to write blog posts….  it’s just hard to write good and useful ones).   So when I ranted last week about the lack of Unix tools in the SVC/Storwize rbash shell, it turns out I had missed the memo…  they have finally arrived!    Barry Whyte revealed all here:

It has been an interesting journey to watch the SVC code base mature and expand its horizons as it has gone from a straight storage virtualization platform to a storage controller platform as well.   In the process the effective user base for this platform has increased dramatically, I suspect well in excess of 10x fold.    And with that increase comes much more pressure on usability.     In the CLI space, the first big thing to change was that IBM dropped the requirement to use private/public keys to login to the CLI shell.   The fact of the matter is that many storage admins are not UNIX experts…  they don’t use PuTTY in their daily lives and running PuttyGen to generate keys felt positively strange to them.   They just wanted to type a username and password and be logged in….  and that is precisely what they got.

The need to preface info and task types commands with svcinfo and svctask also went by the wayside, which made things simpler as well (although I usually leave them in sample scripts for the ultimate backwards compatibility).

But in the opposite direction was the restricted bash shell that the CLI user gets to live in. Unix users (particularly those power users who know how to bash out a complex awk command in 40 keystrokes), were all a little stunned that they had to run all the cool commands outside the SSH command.    Telling them it has been that way for 12 years didn’t make them feel any better.

So with the 7.5 release (which is still fairly steaming new), you get 11 Unix shell commands that are all very very sweet:


So for a simple example, if I want to grab just the VDisk IDs for all VDisks I would normally use a command like this one:   lsvdisk -delim , -nohdr | cut -d”,” -f1

Run on a pre 7.5 machine I get:

IBM_Storwize:anthonyv>  lsvdisk -delim , -nohdr | cut -d"," -f1
rbash: grep: command not found

On a 7.5 machine I get:

IBM_Storwize:anthonyv>lsvdisk -delim , -nohdr | cut -d"," -f1

What’s nice is that a Windows admin firing commands with plink can now get unix commands run on the remote side with just a Windows plink command on the local side.

My only complaint is lack of floating point support.   Bash by default does not handle floating point.   To prove this, here is an amusing example where we get bash to perform some division:

IBM_Storwize:anthonyv>echo "4 divided by 2 equals $(( 4/2 ))"
4 divided by 2 equals 2

IBM_Storwize:anthonyv>echo "2 divided by 4 equals $(( 2/4 ))"
2 divided by 4 equals 0

I normally use awk to handle floating point calculations, so if you were to sum the size in bytes of a number of MDisks and then divide by 1024 to get GiB you may want to get a result to say three decimal places, but with bash built ins and the tools so far exposed to the rbash user, you cannot do this.  The bc command can also handle floating point, so next time you see your IBM rep, say ‘nice try but you missed floating point’.

Let me know what they say.    #;-)

Posted in IBM, IBM Storage, Storwize V3700, Storwize V7000, SVC | 1 Comment

Some scripting hints for checking your Storwize firmware version

I am in the habit of writing mini-shell scripts to paste into an SVC or Storwize terminal to create mini reports.  While the GUI does make all this quite easy, I am quite often remote from the machine, or at the end of a VPN tunnel, so using the GUI is not always convenient.    So for instance if you want to learn just the software version of your Storwize device and the firmware version of its drives,  there is no super quick way to do that from the command line.  You can use the lssystem command to get the cluster software version, but since there is no grep command on Storwize/SVC, you need to sort through all the output yourself host side or use some fancy bash tricks.   You can use the lsdrive command to get the drive firmware version but the lsdrive command does not show the drive types or firmware version in the summary version.   This is rather annoying as it means you need to run lsdrive against every drive to get that level of detail.   In a perfect world I should be able to specify which fields I want in the summary view (see an example below of the rather sparse summary view):


I wrote a small script to display the firmware version of each drive.   It looks like this:

firmwareversion=$(svcinfo lssystem -delim , | while IFS="," read -ra data
if [ "${data[0]}" == "code_level" ] 
then echo ${data[1]} 
drive=$(printf "%5s%-20s%-10s%-15s \n" "ID" " DriveType" "Capacity" "Version"
svcinfo lsdrive -nohdr -delim , | while IFS="," read -ra drives
svcinfo lsdrive -delim , ${drives[0]} | { while IFS="," read desc data
[[ $desc == "id" ]] && id=$data
[[ $desc == "product_id" ]] && product_id=$data
[[ $desc == "capacity" ]] && capacity=$data
[[ $desc == "firmware_level" ]] && firmware_level=$data
printf "%5s%-20s%-10s%-15s \n" "$id" " $product_id" "$capacity" "$firmware_level"
done);echo "";echo "Version $firmwareversion";echo "";echo "$drive"

Now for those who understand shell scripts, it uses the –delim , option to separate fields (since in general, commas are not allowed to appear in any data fields).  It then reads the output into an array with the command read -ra, telling the read command that the field delimiter is a comma with the IFS=”,” statement.

When I ran it on a Storwize V3700 running 7.3 code, it ran fine with output like this:

Version (build 97.5.1501190000)

ID  DriveType       Capacity  Version 
 0  HUS723020ALS64  1.8TB     J3K8 
 1  HUS723020ALS64  1.8TB     J3K8 
 2  HUS723020ALS64  1.8TB     J3K8 
 3  HUS723020ALS64  1.8TB     J3K8 
 4  HUS723020ALS64  1.8TB     J3K8 
 5  HUS723020ALS64  1.8TB     J3K8

  But when I ran it on a V3700 running 7.4 or 7.5 code I got this:

rbash: IFS: readonly variable
rbash: IFS: readonly variable
CMMVC5709E [0,online,,member,sas_nearline_hdd,2.7TB,0,mdisk1,5,1,8,,,inactive] is not a supported parameter.

I googled the issue and found this rather helpful forum discussion:

The solution is two-fold:

  1. Dont use IFS to define field separators.
  2. Don’t allow any names in your system to have spaces in it.  This normally doesn’t occur but it appears they may be allowing them in VDisk names.

So I re-wrote my script to not rely on IFS to separate fields and it now looks like this (and runs fine on release 7.4 and 7.5 machines):

firmwareversion=$(svcinfo lssystem | while read desc data
if [ "$desc" == "code_level" ] 
then echo $data 
drive=$(printf "%5s%-20s%-10s%-15s \n" "ID" " DriveType" "Capacity" "Version"
svcinfo lsdrive -nohdr | while read did status 
svcinfo lsdrive $did | { while read desc data  
[[ $desc == "id" ]] && id=$data 
[[ $desc == "product_id" ]] && product_id=$data 
[[ $desc == "firmware_level" ]] && firmware_level=$data
[[ $desc == "capacity" ]] && capacity=$data
printf "%5s%-20s%-10s%-15s \n" "$id" " $product_id" "$capacity" "$firmware_level"
done);echo "";echo "Version $firmwareversion";echo "";echo "$drive"

The other suggestion from the forum post is to run this whole script externally, which is a great suggestion but not as easily done as it sounds, as running an external script vs pasting in a script can cause a lot of back and forward traffic.

I wrote a BASH script that learns all the drives in one command and then gets all the detailed views in a single command as a second step.    So I pull all that I need about each drive with only two SSH commands (rather than 1 per drive).  Sorry folks, this is Unix or Mac OS only (unless you’re running some unix tools on your Windows machine).

This script presumes you have the SSH key already set up for your userid (since I don’t specify a key, but you could add it to the script).   There are a large number of blank lines simply to make each section clear.

Simply paste it into a file like this

vi   <then hit ‘i’ and paste in the data, then shift ZZ to save and exit >
chmod 755
./ -u superuser -h   < where super user is your user and is your V7000 >

The script uses optargs to get two inputs and check for them.   It has no error checking for an unreachable host.   If the user cannot login with your default SSH key it will fail.

# Script to display SVC or Storwize firmware versions

while getopts :u:h: opt
 case "$opt"
 u) username="$OPTARG";;
 h) hostname="$OPTARG";;

# We need a user name
if [ -z "$username" ] 
echo "Please use a username with -u"
echo "For instance -u superuser"

# we need a host
if [ -z "$hostname" ] 
echo "Please use a host with -h"
echo "For instance -h"

# Fetch and print the system software version
echo "Host $hostname is running Code Level: $(ssh $username@$hostname "svcinfo lssystem -delim ," | grep code_level | cut -d, -f2)"

# print the header for the drive data
printf "%5s%-15s%-20s%-15s \n" "ID" " Capacity" "DriveType" "Version"

# Fetch the drive summary view to get a list of drives
summarydrives=$(ssh $username@$hostname "svcinfo lsdrive -nohdr -delim ,")

# build the drive detailed view as one command before sending it
fetchdetailed=$(echo "$summarydrives" | while IFS="," read -ra drivedata
echo -n "svcinfo lsdrive -delim , ${drivedata[0]};"

# now grab all the detailed drive data in one command
detailedview=$(ssh $username@$hostname "$fetchdetailed")

# now chunk through the detailed view output and print in table view
echo "$detailedview" | while IFS="," read desc data
[[ $desc == "id" ]] && printf "%5s" "$data"
[[ $desc == "capacity" ]] && printf "%-15s" " $data"
[[ $desc == "product_id" ]] && printf "%-20s" "$data"
[[ $desc == "firmware_level" ]] && printf "%-15s \n" "$data"

Hopefully this is useful to someone out there.  Suggestions always welcome!


Posted in advice, IBM Storage, SAN, Storwize V3700, Storwize V7000, SVC | Tagged , , , | 6 Comments

IBM Releases several Data Integrity Alerts for Storwize products

IBM recently released three major and significant alert for Storwize products (V3500, V3700, V5000 and V7000).

I am reproducing the text from the emails I received.   I tell you this because if IBM update the Website text, my blog post may not get updated.

1691 Error on Arrays When Using Multiple FlashCopies of The Same Source

ABSTRACT: There is an issue in the RAID software that calculates parity for systems that have multiple FlashCopies of the same source. This issue will cause the parity to be calculated incorrectly and may lead to the system logging a 1691 error and may eventually lead to an undetected data loss.

Affects: Storwize devices on 7.3 and 7.4 versions
Resolution: This issue is resolved in and

Note that 7.5.0. is not the latest version – do not install that version!
At time of writing is available. If you are on 7.3 or 7.4 then stick with

Note also that the IBM link above says that the issue affects only V7000s, but this is because there are separate alerts and pages for each Storwize model.
If you are using Storwize products of any kind with FlashCopy you are affected.  If you are not using FlashCopy, read on!

Data Integrity Issue when Using Encrypted Arrays

ABSTRACT: IBM has identified an issue which can cause data to be written to the wrong location on the drive when using encrypted arrays on Storwize V7000 Gen2 systems. This will often result in systems logging 1691 and 1322 errors, and undetected data loss.
Affects: V7000s on 7.4 and 7.5 versions
Resolution: This issue is resolved by APAR HU00820 in releases and

This really does affect only V7000s a other models don’t offer this software encryption feature.   If you are not using Encryption, read on!

Data Integrity Issue when Drive Detects Unreadable Data

ABSTRACT: IBM has identified specific hard disk drive models supported by the Storwize family of products that may be exposed to possible undetected data corruption during a specific drive error recovery sequence. The corrupted data can eventually trigger the system to log a 1691 error. A firmware update that remediates against future occurrences of this issue is now available. IBM recommends that all customers with the affected drives apply these latest levels of code.

Note also that the IBM link above says that the issue affects only V7000s, but this is because there are separate alerts and pages for each Storwize model.
If you are using Storwize products of any kind with the listed Seagate disks then you are affected.

Now the website lists capacities…. but again you might be fooled.
The capacity shown here are decimal but the Storwize GUI and CLI are always adhere to binary honesty (which I like).  So don’t be fooled by the idea you are told by the GUI you have 3.6 TB drives and they are not listed in the table below…. They are 4 TB drives according to the label.

Product_id   Capacity   Minimum Firmware level containing fix 
ST300MM0006    300 GB   B56S
ST600MM0006    600 GB   B56S
ST900MM0006    900 GB   B56S
ST1200MM0007   1.2 TB   B57D
ST2000NM0023     2 TB   BC5G
ST3000NM0023     3 TB   BC5G
ST4000NM0023     4 TB   BC5G
ST6000NM0014     6 TB   BC75

Also in the GUI, I found the firmware version of my drives was not shown by default, I had to add it as per the screen capture below.   Here is a quiz question…  does the screen capture show a potentially affected machine?


If you answered YES you would be correct!

To be sure we can run the software upgrade tool, or dump the script below into a CLI window (paste the whole thing!):

svcinfo lsdrive -nohdr -delim , | while IFS="," read -ra drives; do svcinfo lsdrive -delim , ${drives[0]} | { while IFS="," read desc data ; do [[ $desc == "id" ]] && id=$data; [[ $desc == "product_id" ]] && product_id=$data; [[ $desc == "firmware_level" ]] && firmware_level=$data; done; printf "%5s%10s%10s \n" "$id " "$product_id" "$firmware_level"; }; done

The output will look like this (I showed the paste so you see what your entire PuTTY session would look like).    Again, is this an affected machine?


Yes it is affected, as BC5C is below BC5G (G being later than C in the alphabet!).

Once you know you are affected, you can follow the upgrade instructions in the IBM Alert. It is much easier to do this on 7.4 as you can upgrade your drives from the GUI instead of using the CLI.




Posted in IBM Storage, Storwize V3700, Storwize V7000, Uncategorized | 6 Comments

Actifio worked around the VMware CBT bug in 2012

There has been a lot of discussion lately about a VMware Change Block Tracking (CBT) bug that causes backup software to miss out on modified parts of VMDK files.  This results in corrupted backups.

The Register has had two articles about it:

Oct 27:  ESXi is telling fibs to backup software • The Register

Nov 3: VMware: Yep, ESXi bug plays ‘finders keepers’ with data backups • The Register

The articles point to this VMware Knowledgebase link, and mentions that there is no fix available from VMware.  Ouch!

Well of course since Actifio uses VMware Change Block Tracking (CBT) to capture images of VMs, my first thought was sh…    ahh actually this is child friendly blog…   but you get the idea.   Were petabytes of client data at risk of being bad?

Fortunately the answer is a resounding no.  Actifio does not depend on this particular API because we saw the potential for a flaw like this a long time ago.  In fact we changed the way we use the VMware APIs in 2012 to ensure this API could not affect us

Actually we were that concerned about data integrity when using external APIs that we developed a feature we call Fingerprinting to ensure the integrity of our images. With every image that Actifio creates, Actifio uses a sampling technique to confirm that the image created in Actifio’s storage pools is the same as the source we were fetching data from.  This applies to both VMs and to images created by our Connector software.

So with this Actifio customers can be assured that all available virtual images are free of any corrupt data due to CBT, backup calls, or any other capture procedures.

Posted in Actifio | Tagged , , | 2 Comments

Monitoring IBM Storwize and IBM SVC products with Splunk

I have been playing around with Splunk recently, so I can understand what it is and why my customers may choose to it.   For those that don’t know, Splunk (the product) captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.  In essence Splunk is a really cool and smart way to look at and analyse your data.

Because Splunk is able to ingest data from almost any source we can quite easily start pulling data out of an IBM Storwize or SVC product and then investigate with Splunk.  I couldn’t find anything in Google on this subject, so here is a post that will help you along.

A common way to get data into Splunk is to use syslog.   Since Storwize can send events to syslog, all we need to do on the Storwize side is configure where the Splunk server is.

In this example I have chosen syslog level 7 (which is detailed output) and to send all events.


Then on Splunk side, ensure Splunk is listening for syslog events.   Storwize always uses UDP port 514:


However this really only captures events.   There are lots of other pieces of information we may want to pull out of our Storwize products and graph in Splunk.   So lets teach Splunk how to get them using CLI over SSH.

Firstly we need to supply Splunk a user ID so it can login to our Storwize and grab data.   I created a new user on my Storwize V3700 called Splunk, placed it in the Monitor group (so anyone with the Splunk userid and password can look but not touch) and then supplied a public SSH key since I don’t want to store a password in any text file and using SSH keys makes things nice and easy.  In this case I am using the file for the root user of my Splunk server, since in my case Splunk is running all scripts as root.


Now from my root command prompt on the Splunk server  (called av-linux) I test that access works to my V3700 (on IP address using the lsmdiskgrp command.   It’s all looking good.

[root@av-linux ~]# ssh splunk@ "lsmdiskgrp -delim ,"

So I am now set up to write scripts that Splunk can fire on a regular basis to pull data from my Storwize device using SSH CLI commands.

Now here are two important things to realize about using SSH commands to pull data from Storwize and ingest them into Splunk:

  1. For historical data like logs, it is very easy to pull the same data twice.  For instance if I grab the contents of the lseventlog command using an SSH script then I will get every event in the log, which is fine.   But if I grab it again the next day, most of the same events will be ingested.   If I am looking to validate how often a particular event occurs I will count the same event many times as I ingested it many times.   Ideally the Storwize CLI commands would let me filter on dates, but that functionality is not available
  2. Real time display commands don’t insert a date into the output, but Splunk will log the date and time that each piece of data was collected on.

Lets take the output of lsmdiskgrp as shown above.   If we run this once per day we could track the space consumption of each pool over time.   Sounds good right?   So on my Splunk server I create a script like this.  Notice I get the output in bytes, this is important as the default output could be in MB or GB or TB.

ssh splunk@ “lsmdiskgrp -delim , -bytes”

I put the script into the /opt/splunk/bin/scripts folder and call it v37001pools.

I make it executable and give it a test run:

[root@av-linux scripts]# pwd
[root@av-linux scripts]# chmod 755 v37001pools
[root@av-linux scripts]# ./v37001pools

So now I tell Splunk I have a new input using a script:


Input the location of the script, the interval and the fact that this is CSV (because we are using -delim with a comma.  Note my interval is crazy:   every 60 seconds is way too often, even every 3600 seconds is probably too often.  I used it to get lots of samples quickly.


I now confirm I have new data I can search:


And the data itself is time stamped with all fields identified and has all the data like pool names.

Now I can start graphing this data.   With Splunk what I find is that if someone publishes the XML this makes life way easier.    So I created an empty Dashboard called Storwize Pools and then immediately select Edit Source


Now replace the default source (delete any text already in the source) with this where you change the heading and script name with your own (in red) and the pool name of one of your pools (in blue).  If you have more than one pool, add an additional chart for every pool (copy all the chart section and just make a new chart).

In the attached word document you will find the required XML.   For some reason WordPress kept fighting me and changing my quotes so I have attached the XML as a doc.


And we get a lovely Dashboard that looks like this.  Because the script runs every 60 seconds, I am getting 60 second stats.


We could run it every day or use a cron job to run it at the same time of every day (which makes more sense).   Maybe once per day at 1am by setting the interval to a cron value like this:   0 01 * * *


So hopefully that will help you get started with monitoring your SVC or Storwize product with Splunk.

If you would like some more examples, just leave a comment!

Posted in Uncategorized | 2 Comments