Accessing the Instrumentation

Here are some rather bad photos of my 1972 Holden HQ Kingswood Premier, one of my first ever cars (and one that I sadly no longer own):

IMG_0001 IMG_0009

This was the V8 Four Litre model (actually 253 cubic inches, often jokingly described as having all the power of a 6 cylinder with the fuel economy of an 8).   The engine bay was so huge and empty I could open the bonnet and sit on the side of the car with my feet comfortably inside the bay while I changed spark plugs or cleaned the points.

The Kingswood was not what you would call an instrumented vehicle.   The dashboard had a speedo, a fuel gauge and three lights:   Temperature, Oil and Charging.   I dubbed these three lights the idiot lights: as once they came on,  you were the idiot.  (sorry, no picture; this was the 1980s).

Modern storage infrastructure by comparison is slightly more instrumented.   A vast array of metrics are tracked and these can be used to perform all sorts of analysis.   Analysis like:

  • Are my hosts getting good response times?
  • Are specific disks or arrays being over worked?
  • Are my fibre ports being used in a balanced fashion?

So can you do this with the Storwize products?    Of course!   I documented the built-in tools here (where I talked about the Performance GUI):

And here (where I talked about the performance CLI):

But these tools have only limited usefulness.   They are not granular, in that you cannot look at specific hosts or specific arrays or specific FC ports (meaning the three analysis ideas I suggested above are not even possible).   So how can we do this analysis?

The good news is that Storwize products do track all the metrics needed to do very granular analysis and these are freely accessible.   These files are documented by IBM, here is a fairly old page that documents some of them:

But how to turn these into something useful?   There used to be a tool called svcmon but this tool appears to have been killed as per this rather sad blog post:

There is another IBM Community developed tool called qperf which you can access using the link below:

With a graphing tool here:

And another tool here:

And yet another one here!

The challenge for many of these tools is that they require manual setup, usually have a limited database engine and analysis is not always easy or simple.

You can of course use IBM’s TPC:

You could also consider Intellimagic.   Although I have not looked too deeply at this one, these guys wrote IBM’s Disk Magic tool, so they certainly understand storage performance

The challenge for all Storage Admins is that they are not always experts at diagnosing performance issues.    Getting some genuine examples of the thinking process and the flow of getting from problem to solution, is vital.   This makes  BVQ  another good choice.

To see an example of how instrumented data presented in a graphical format can be used to generate a useful problem analysis, check out this blog post here:

and another one here:

I really like these posts for two reasons:

  1. They clearly shows just how instrumented the product is
  2. They clearly show how using this data in a graphical format can lead to good and quick root cause analysis.

Also have a look at some of these videos:

So how are you instrumenting your Storwize?
What do you find the easiest tool to use?




Posted in Uncategorized | 5 Comments

DevOps Culture lessons for all of us?

A colleague of mine recently pointed me at this fascinating document:

It seems everyone is talking about DevOps right now, and if we accept even half of what this report tells us, to ignore what it is saying would be an opportunity lost.   In essence the report suggests:

High-performing IT organizations experience 60 times fewer failures and recover from failure 168 times faster than their lower-performing peers. They also deploy 30 times more frequently with 200 times shorter lead times.

Can we believe these numbers?   There is a risk that measured results don’t totally equate to the benefits of DevOps adoption.   After all many of the early adopters in this space were in fast growing or new-market sectors.   They were probably going to grow rapidly regardless, simply because they were already innovating in new areas.    However what they do clearly show is that if you can achieve high rates of change, but with lower levels of risk, you can adopt faster to market needs and customer demands.   And that is what would provide you with competitive advantage.

A simple example of speed to market I struck literally days ago, after upgrading my iPhone to iOS 9 and then finding my Banks iPhone app kept crashing.  They had missed the upgrade boat so to speak and took a week to catch up.   In some ways I should be pleased it was only a week, but in today’s economy, seven days is a lifetime.

However the part of the report that really struck a chord for me was the section titled Why Culture Matters.   EVERYONE in every company should read this section.   Print out these tables.  Tape them to the desk or wall.   Bring them to meetings.   Reflect loudly….  what kind of manager do you have?   What kind of culture is your management team engendering?


More importantly are these strategies being followed?


On the same day, I read/listened to this:

The fundamental message being that providing employees with places to gather and chat informally can generate huge benefits.   Can this even occur in companies who don’t even provide their workers with tea and coffee?

I finished my day reading the DevOpsGuys blog.   I loved this discussion of technical debt (and the blog from Box referenced in the comments).   How many organizations out there are burdened by technical debt?

Don’t know what I mean?   Read the blog:


Posted in Uncategorized | Leave a comment

Do not install ESXi 5.5 Update 3 if you rely on VMware snapshots

ESXi 5.5 Update 3 was released on September 16, 2015.   Since it was released it has emerged that after upgrading an ESXi host to this update, a snapshot consolidation task can result in the relevant Virtual Machine suffering an outage.

This disruptive issue occurs due to a segmentation fault when changing the snapshot tree data-structure.

More details are here:

Snapshot consolidation causes virtual machines running on VMware ESXi 5.5 Update 3 hosts to fail with the error: Unexpected signal: 11 (2133118)

Clearly you should not install this update if your data protection software relies on VMware snapshots.   If you have already installed it, consult the VMware link above for a work around strategy or suspend your snapshot scheduler (which you may need to do from your data protection software) while we wait for a fix from VMware.


Posted in vmware | Tagged | 1 Comment

Making sense of the IBM SVC/Storwize Code release cycle

UPDATE 14 August 2015
When I initially posted this blog, there was a major error in my base spreadsheet, that made the time periods shorter than they actually were (because it was only using weekdays, 5 days per week, not 7 days per week, which was my mistake).
I withdrew the post and corrected it and this is an edited update.   The conclusions remain pretty well the same, but the time periods are larger than initially stated.

A common question I get is this:

A new version of code has just been released for my SVC or Storwize product, should I upgrade or should I wait?

The challenge for many customers is that these upgrades:

  • Need change windows
  • Cannot be backed out
  • Rely on redundancy to avoid downtime
  • Take over an hour to complete

So when is the right point to get the most reliability and access to new features and hardware support, but with the least number of change windows?  It occurred to me only recently that rather than rely on a bunch of potentially subjective or gut feel decision points, could I just use maths here?   Go all Freakonomics on this subject.

Now it turns out IBM actually made this fairly easy as they publish the build dates for the entire release history right here:

They are usually in a format like 115.51.1507081154000  where the 1507081154000 can be read as  11:54 on the 8th of July 2015.      Now I don’t work for IBM engineering, but my take is that if they publish a build date then I am fairly confident that this will be when the code was built from source.   Normally it is then fed to QA and if it passes their sometimes real world tests (I am only being slightly sarcastic), it hits the field release process.   So the build date is not when it was released, but when it was built.

So I took all of these dates and put them into a spreadsheet and then calculated time periods between builds (which I will call release dates, knowing they were NOT the actual date of release).

I considered three periods:

  • The time period between major releases  (i.e. days between between 7.3 and 7.4).   I made an executive decision to treat releases 4.1.0 and 4.1.1 as major even though they are probably not.   You will see why as we go through.   This metric shows how often major releases come out.
  • Days between updates within a release (for instance how many days between and compared to how many days between and   This metric shows how often patches come out.
  • Days between the build date of each update and the build date of its major release.  For instance, days between and versus days between and  This metric shows the patch release lifecycle of each release.

I then graphed each metric to get a visual impression of these.   Lets check them out…

Time period between major releases

This one is interesting as the trend is clear, a new major release is coming out roughly every 180 days.   This shows that a sixth monthly release cycle is definitely being worked to.


However there is a large glitch in the center, which makes you wonder whether there were some delays in certain releases.    We know that Release 5.0 had 64 bit changes and Release 6.0 brought the Storwize platform into play, so that helps explain that spike.


Note that doesn’t appear as there was no release before that!

Days between updates within a release

This one is quite interesting.   If you go to a release, how many days will pass before another build hits the field?   In other words, I just upgraded… how many days will pass before I may potentially have to upgrade again (presuming I am determined to always run the latest release).

The short answer is that the interval based on the trend line Excel added, shows it is becoming shorter as time goes by.  It started at 70 days and is now closer to 45 days. However that’s using a linear trendline, a logarithmic trendline is much flatter and stays almost constantly on 50 days.


Days between updates from the build of that major release

So this is the most interesting graph of all.    Once a major release comes out, a series of patch releases come out for that minor release.  How quickly do they get released?   If the first five updates come out in the first 60 days and then there is a 30 day gap, does that mean I should wait 90 days?


This endless series of hills shows the way release histories run.   At the start of each cycle there are lots of releases, each one coming out on a slightly longer period than the previous one.   Usually there is always a very late update, probably a roll-up for the slow upgraders (thus the sudden spike to the top of each hill).   It also shows the effective lifespan of a code release is around 500-550 days as new updates simply don’t come out after a certain point.   The more recent code levels taper off on those peaks, as they are not that old yet.

This leaves the one killer question… how long should I wait?    I looked at just release 7.1 to 7.5 to see if I could work out just from the release cycle, when to jump.   The red line shows the days between each release while the blue line shows the cumulative days within the release.   From a user perspective, the higher the red line gets the better, as this means less code churn.   So I look for the first major red peak and then find the matching point on the blue graph to see when that occurred.


I was looking for when the red line gets over 60 days and stays there, which typically occurs between 120 and 250 days after a release.   I think waiting for 5 months is a very long interval, it really depends on how conservative you want to be.   I do note that release 7.4 has the best correlation between the red and blue graphs, which is a very good sign.

So what can we learn from all of this?

I certainly learnt several things:

  • Major release cycles are six monthly
  • Patch updates taper off as the release ages and eventually stop after around 18 months
  • Patch updates are coming out  on average every 50 days
  • It takes between 150 and 250 days before new patch release intervals start to slow down within each build

What I need to do next is look at the incidence of hyper releases (ones that have severe impact) to see if we can add any extra metrics to this study.

For reference, attached it the spreadsheet I built with the IBM release dates.









Posted in IBM, IBM Storage, Storwize V3700, Storwize V7000, SVC, Uncategorized | 6 Comments

Unix tools for the SVC and Storwize masses: a 12 year journey

So it turns out the only thing harder than writing blog posts is keeping up with other people’s blog posts  (actually it’s easy to write blog posts….  it’s just hard to write good and useful ones).   So when I ranted last week about the lack of Unix tools in the SVC/Storwize rbash shell, it turns out I had missed the memo…  they have finally arrived!    Barry Whyte revealed all here:

It has been an interesting journey to watch the SVC code base mature and expand its horizons as it has gone from a straight storage virtualization platform to a storage controller platform as well.   In the process the effective user base for this platform has increased dramatically, I suspect well in excess of 10x fold.    And with that increase comes much more pressure on usability.     In the CLI space, the first big thing to change was that IBM dropped the requirement to use private/public keys to login to the CLI shell.   The fact of the matter is that many storage admins are not UNIX experts…  they don’t use PuTTY in their daily lives and running PuttyGen to generate keys felt positively strange to them.   They just wanted to type a username and password and be logged in….  and that is precisely what they got.

The need to preface info and task types commands with svcinfo and svctask also went by the wayside, which made things simpler as well (although I usually leave them in sample scripts for the ultimate backwards compatibility).

But in the opposite direction was the restricted bash shell that the CLI user gets to live in. Unix users (particularly those power users who know how to bash out a complex awk command in 40 keystrokes), were all a little stunned that they had to run all the cool commands outside the SSH command.    Telling them it has been that way for 12 years didn’t make them feel any better.

So with the 7.5 release (which is still fairly steaming new), you get 11 Unix shell commands that are all very very sweet:


So for a simple example, if I want to grab just the VDisk IDs for all VDisks I would normally use a command like this one:   lsvdisk -delim , -nohdr | cut -d”,” -f1

Run on a pre 7.5 machine I get:

IBM_Storwize:anthonyv>  lsvdisk -delim , -nohdr | cut -d"," -f1
rbash: grep: command not found

On a 7.5 machine I get:

IBM_Storwize:anthonyv>lsvdisk -delim , -nohdr | cut -d"," -f1

What’s nice is that a Windows admin firing commands with plink can now get unix commands run on the remote side with just a Windows plink command on the local side.

My only complaint is lack of floating point support.   Bash by default does not handle floating point.   To prove this, here is an amusing example where we get bash to perform some division:

IBM_Storwize:anthonyv>echo "4 divided by 2 equals $(( 4/2 ))"
4 divided by 2 equals 2

IBM_Storwize:anthonyv>echo "2 divided by 4 equals $(( 2/4 ))"
2 divided by 4 equals 0

I normally use awk to handle floating point calculations, so if you were to sum the size in bytes of a number of MDisks and then divide by 1024 to get GiB you may want to get a result to say three decimal places, but with bash built ins and the tools so far exposed to the rbash user, you cannot do this.  The bc command can also handle floating point, so next time you see your IBM rep, say ‘nice try but you missed floating point’.

Let me know what they say.    #;-)

Posted in IBM, IBM Storage, Storwize V3700, Storwize V7000, SVC | 1 Comment

Some scripting hints for checking your Storwize firmware version

I am in the habit of writing mini-shell scripts to paste into an SVC or Storwize terminal to create mini reports.  While the GUI does make all this quite easy, I am quite often remote from the machine, or at the end of a VPN tunnel, so using the GUI is not always convenient.    So for instance if you want to learn just the software version of your Storwize device and the firmware version of its drives,  there is no super quick way to do that from the command line.  You can use the lssystem command to get the cluster software version, but since there is no grep command on Storwize/SVC, you need to sort through all the output yourself host side or use some fancy bash tricks.   You can use the lsdrive command to get the drive firmware version but the lsdrive command does not show the drive types or firmware version in the summary version.   This is rather annoying as it means you need to run lsdrive against every drive to get that level of detail.   In a perfect world I should be able to specify which fields I want in the summary view (see an example below of the rather sparse summary view):


I wrote a small script to display the firmware version of each drive.   It looks like this:

firmwareversion=$(svcinfo lssystem -delim , | while IFS="," read -ra data
if [ "${data[0]}" == "code_level" ] 
then echo ${data[1]} 
drive=$(printf "%5s%-20s%-10s%-15s \n" "ID" " DriveType" "Capacity" "Version"
svcinfo lsdrive -nohdr -delim , | while IFS="," read -ra drives
svcinfo lsdrive -delim , ${drives[0]} | { while IFS="," read desc data
[[ $desc == "id" ]] && id=$data
[[ $desc == "product_id" ]] && product_id=$data
[[ $desc == "capacity" ]] && capacity=$data
[[ $desc == "firmware_level" ]] && firmware_level=$data
printf "%5s%-20s%-10s%-15s \n" "$id" " $product_id" "$capacity" "$firmware_level"
done);echo "";echo "Version $firmwareversion";echo "";echo "$drive"

Now for those who understand shell scripts, it uses the –delim , option to separate fields (since in general, commas are not allowed to appear in any data fields).  It then reads the output into an array with the command read -ra, telling the read command that the field delimiter is a comma with the IFS=”,” statement.

When I ran it on a Storwize V3700 running 7.3 code, it ran fine with output like this:

Version (build 97.5.1501190000)

ID  DriveType       Capacity  Version 
 0  HUS723020ALS64  1.8TB     J3K8 
 1  HUS723020ALS64  1.8TB     J3K8 
 2  HUS723020ALS64  1.8TB     J3K8 
 3  HUS723020ALS64  1.8TB     J3K8 
 4  HUS723020ALS64  1.8TB     J3K8 
 5  HUS723020ALS64  1.8TB     J3K8

  But when I ran it on a V3700 running 7.4 or 7.5 code I got this:

rbash: IFS: readonly variable
rbash: IFS: readonly variable
CMMVC5709E [0,online,,member,sas_nearline_hdd,2.7TB,0,mdisk1,5,1,8,,,inactive] is not a supported parameter.

I googled the issue and found this rather helpful forum discussion:

The solution is two-fold:

  1. Dont use IFS to define field separators.
  2. Don’t allow any names in your system to have spaces in it.  This normally doesn’t occur but it appears they may be allowing them in VDisk names.

So I re-wrote my script to not rely on IFS to separate fields and it now looks like this (and runs fine on release 7.4 and 7.5 machines):

firmwareversion=$(svcinfo lssystem | while read desc data
if [ "$desc" == "code_level" ] 
then echo $data 
drive=$(printf "%5s%-20s%-10s%-15s \n" "ID" " DriveType" "Capacity" "Version"
svcinfo lsdrive -nohdr | while read did status 
svcinfo lsdrive $did | { while read desc data  
[[ $desc == "id" ]] && id=$data 
[[ $desc == "product_id" ]] && product_id=$data 
[[ $desc == "firmware_level" ]] && firmware_level=$data
[[ $desc == "capacity" ]] && capacity=$data
printf "%5s%-20s%-10s%-15s \n" "$id" " $product_id" "$capacity" "$firmware_level"
done);echo "";echo "Version $firmwareversion";echo "";echo "$drive"

The other suggestion from the forum post is to run this whole script externally, which is a great suggestion but not as easily done as it sounds, as running an external script vs pasting in a script can cause a lot of back and forward traffic.

I wrote a BASH script that learns all the drives in one command and then gets all the detailed views in a single command as a second step.    So I pull all that I need about each drive with only two SSH commands (rather than 1 per drive).  Sorry folks, this is Unix or Mac OS only (unless you’re running some unix tools on your Windows machine).

This script presumes you have the SSH key already set up for your userid (since I don’t specify a key, but you could add it to the script).   There are a large number of blank lines simply to make each section clear.

Simply paste it into a file like this

vi   <then hit ‘i’ and paste in the data, then shift ZZ to save and exit >
chmod 755
./ -u superuser -h   < where super user is your user and is your V7000 >

The script uses optargs to get two inputs and check for them.   It has no error checking for an unreachable host.   If the user cannot login with your default SSH key it will fail.

# Script to display SVC or Storwize firmware versions

while getopts :u:h: opt
 case "$opt"
 u) username="$OPTARG";;
 h) hostname="$OPTARG";;

# We need a user name
if [ -z "$username" ] 
echo "Please use a username with -u"
echo "For instance -u superuser"

# we need a host
if [ -z "$hostname" ] 
echo "Please use a host with -h"
echo "For instance -h"

# Fetch and print the system software version
echo "Host $hostname is running Code Level: $(ssh $username@$hostname "svcinfo lssystem -delim ," | grep code_level | cut -d, -f2)"

# print the header for the drive data
printf "%5s%-15s%-20s%-15s \n" "ID" " Capacity" "DriveType" "Version"

# Fetch the drive summary view to get a list of drives
summarydrives=$(ssh $username@$hostname "svcinfo lsdrive -nohdr -delim ,")

# build the drive detailed view as one command before sending it
fetchdetailed=$(echo "$summarydrives" | while IFS="," read -ra drivedata
echo -n "svcinfo lsdrive -delim , ${drivedata[0]};"

# now grab all the detailed drive data in one command
detailedview=$(ssh $username@$hostname "$fetchdetailed")

# now chunk through the detailed view output and print in table view
echo "$detailedview" | while IFS="," read desc data
[[ $desc == "id" ]] && printf "%5s" "$data"
[[ $desc == "capacity" ]] && printf "%-15s" " $data"
[[ $desc == "product_id" ]] && printf "%-20s" "$data"
[[ $desc == "firmware_level" ]] && printf "%-15s \n" "$data"

Hopefully this is useful to someone out there.  Suggestions always welcome!


Posted in advice, IBM Storage, SAN, Storwize V3700, Storwize V7000, SVC | Tagged , , , | 6 Comments

IBM Releases several Data Integrity Alerts for Storwize products

IBM recently released three major and significant alert for Storwize products (V3500, V3700, V5000 and V7000).

I am reproducing the text from the emails I received.   I tell you this because if IBM update the Website text, my blog post may not get updated.

1691 Error on Arrays When Using Multiple FlashCopies of The Same Source

ABSTRACT: There is an issue in the RAID software that calculates parity for systems that have multiple FlashCopies of the same source. This issue will cause the parity to be calculated incorrectly and may lead to the system logging a 1691 error and may eventually lead to an undetected data loss.

Affects: Storwize devices on 7.3 and 7.4 versions
Resolution: This issue is resolved in and

Note that 7.5.0. is not the latest version – do not install that version!
At time of writing is available. If you are on 7.3 or 7.4 then stick with

Note also that the IBM link above says that the issue affects only V7000s, but this is because there are separate alerts and pages for each Storwize model.
If you are using Storwize products of any kind with FlashCopy you are affected.  If you are not using FlashCopy, read on!

Data Integrity Issue when Using Encrypted Arrays

ABSTRACT: IBM has identified an issue which can cause data to be written to the wrong location on the drive when using encrypted arrays on Storwize V7000 Gen2 systems. This will often result in systems logging 1691 and 1322 errors, and undetected data loss.
Affects: V7000s on 7.4 and 7.5 versions
Resolution: This issue is resolved by APAR HU00820 in releases and

This really does affect only V7000s a other models don’t offer this software encryption feature.   If you are not using Encryption, read on!

Data Integrity Issue when Drive Detects Unreadable Data

ABSTRACT: IBM has identified specific hard disk drive models supported by the Storwize family of products that may be exposed to possible undetected data corruption during a specific drive error recovery sequence. The corrupted data can eventually trigger the system to log a 1691 error. A firmware update that remediates against future occurrences of this issue is now available. IBM recommends that all customers with the affected drives apply these latest levels of code.

Note also that the IBM link above says that the issue affects only V7000s, but this is because there are separate alerts and pages for each Storwize model.
If you are using Storwize products of any kind with the listed Seagate disks then you are affected.

Now the website lists capacities…. but again you might be fooled.
The capacity shown here are decimal but the Storwize GUI and CLI are always adhere to binary honesty (which I like).  So don’t be fooled by the idea you are told by the GUI you have 3.6 TB drives and they are not listed in the table below…. They are 4 TB drives according to the label.

Product_id   Capacity   Minimum Firmware level containing fix 
ST300MM0006    300 GB   B56S
ST600MM0006    600 GB   B56S
ST900MM0006    900 GB   B56S
ST1200MM0007   1.2 TB   B57D
ST2000NM0023     2 TB   BC5G
ST3000NM0023     3 TB   BC5G
ST4000NM0023     4 TB   BC5G
ST6000NM0014     6 TB   BC75

Also in the GUI, I found the firmware version of my drives was not shown by default, I had to add it as per the screen capture below.   Here is a quiz question…  does the screen capture show a potentially affected machine?


If you answered YES you would be correct!

To be sure we can run the software upgrade tool, or dump the script below into a CLI window (paste the whole thing!):

svcinfo lsdrive -nohdr -delim , | while IFS="," read -ra drives; do svcinfo lsdrive -delim , ${drives[0]} | { while IFS="," read desc data ; do [[ $desc == "id" ]] && id=$data; [[ $desc == "product_id" ]] && product_id=$data; [[ $desc == "firmware_level" ]] && firmware_level=$data; done; printf "%5s%10s%10s \n" "$id " "$product_id" "$firmware_level"; }; done

The output will look like this (I showed the paste so you see what your entire PuTTY session would look like).    Again, is this an affected machine?


Yes it is affected, as BC5C is below BC5G (G being later than C in the alphabet!).

Once you know you are affected, you can follow the upgrade instructions in the IBM Alert. It is much easier to do this on 7.4 as you can upgrade your drives from the GUI instead of using the CLI.




Posted in IBM Storage, Storwize V3700, Storwize V7000, Uncategorized | 6 Comments