Updating your social will

I spotted a couple of things on the weekend that I thought were worth sharing.  First up I was looking for a friend of mine in LinkedIn, only to find him twice.  His older profile was attached to his email address at a former employer.  It appears he had forgotten his original LinkedIn password and could no longer get into that profile to reset it (since the reset email went to a corporate address he could no longer get to).  So he had abandoned that profile and started again.     I found a thread about this here in case you are suffering from the same issue.

Later by chance I spotted something else that was much more thought-provoking.  One of my work colleagues sadly passed away last year, but her profile is still in LinkedIn.  Again the email address attached to that profile is her corporate address.   I am unsure whether her family know about her presence on LinkedIn, but it certainly got me thinking.

If your life in the real world involves representing yourself in the digital world (which is almost certain if you’re reading this blog post), then you should think about ensuring that when you are no longer with us, that your loved ones have a way to bring closure to your digital profiles. Would they  know what you wanted them to do?  Are they able to act even if they did know? I suspect not.

What I suggest you do is three things:

  1. Document what you want done with your digital profiles, ideally placing this in your will.   This will give your executor the power to act on your behalf.
  2. Document a way to access these profiles.   Whether this means having passwords recorded somewhere or whether you use a common email address that someone can access if password resets are needed to get to your accounts.  Either way, they need to be able to logon.  Don’t trust on calling Facebook or LinkedIn to get help.  I suspect that could prove difficult.
  3. Do not use your corporate email address for any of your digital profiles.  If you suddenly leave an employer, or worse, suddenly pass away, the first thing your employer will do is shutdown your email account.   A family member would probably have no hope of accessing it anyway.  So changing to a private email address is a good idea regardless.  It will prevent issues resetting profile passwords or an orphaned LinkedIn profile that you can no longer get to.   If you do nothing else after reading this post, at least do this.

If all of this sounds morbid, it probably is.  But it is also critical that you do it.

For more details, check out this newspaper article which documents this quite well.

If you have any suggestions I would really like to hear them.

Posted in advice | Tagged , , , , , , , , , , | 4 Comments

Performance and bench-marking tools

So you need to do some disk performance testing?   Maybe some benchmarking?  What tools are out there to help you out?   Well I am glad you asked… here are some that I use on my daily travels:

IOmeter

IOmeter is an old classic, with emphasis on the word old.    At time of writing, the most recent update was from 2006.   However it remains very popular mainly because it is free and easy to use.

Some tips when using IOmeter:

On Windows, IOmeter needs to be run as an Administrator, which seems to be the most common mistake people make (not running as Administrator means you don’t see any drives).  You can only run one instance of IOmeter in Windows, which means if multiple users logon to the same server, only one user can run IOmeter.   You also really need to run IOmeter with a queue depth ( or number of outstanding I/Os) greater than one, with multiple workers.  If you don’t, you will not be able to drive the storage to saturation.   For instance here are some results running 75% read I/O, 0% random, 4 KB blocks on a Windows 2008 machine with 4 workers.  In each case against the same 128 GB volume on a Storwize V7000 backended by  4 x 300 GB SSDs in a RAID10 array.  In each case I let the machine run for 10 minutes before taking the screen capture to ensure the performance was steady state and not peaking.

Firstly I used a queue depth of one.   Aggregate performance was around 27000 IOPS.

Then I used a queue depth of 10.  Aggregate performance was around 81000 IOPS.

I then used a queue depth of 20.  Aggregate performance was around 113000 IOPS.

What I am trying to show is that taking the defaults (one worker with a queue depth of 1) will not drive the storage to a useful value for comparison… you need to do some tuning and some experimenting to get valid results.  At some point increasing queue depths will not improve performance (it may actually decrease it).

You can get IOmeter from here:

http://www.iometer.org/

There is a great how to document for IOmeter here:

http://greg.porter.name/wiki/HowTo:iometer

IOrate

There is an  alternative to IOmeter called IOrate (created by an EMC employee).   It is also very popular and appears to still be in active development.  It is not unusual to see IBM performance whitepapers that used IOrate to generate the workload.

You can download it from here:

http://iorate.org/

I/O Analyzer

This is a fairly recent tool that I have not had a chance to try out (due to time pressures).   The tool uses virtual machines under VMware to generate the I/O and includes some very nice workload capture and playback tools as well as reporting tools.

http://labs.vmware.com/flings/io-analyzer

There is a good write-up on some results and experiences here:

http://gabrielchapman.com/?p=82

Jetstress

Jetstress is a benchmarking tool created by Microsoft to simulate Microsoft Exchange workloads.   I like the fact you can configure it to run for very long periods and it has a more real world feel about it than just running empty I/Os.   You can get the base software here, but you will also need some files from a Microsoft Exchange install DVD (or from an installed instance of Microsoft Exchange).  If you cannot get to those files you cannot complete the startup process inside Jeststress.

You can get Jetstress itself from here:

http://www.microsoft.com/download/en/details.aspx?id=4167 

There is a great how-to guide for Jetstress here:

http://gallery.technet.microsoft.com/Jetstress-Field-Guide-1602d64c

Orion tool

Oracle offer a tool on their website called Orion, which will simulate the workload of an Oracle database.   You can get the tool from here (although you will need to create a free Oracle user account before you can download it).

http://www.oracle.com/technetwork/topics/index-089595.html

SDelete

SDelete is not a benchmarking tool or a performance modelling tool.  But it is a great way to generate I/O with very little effort.   Just create a new drive in Windows and then run SDelete against it with the -c parameter.  This parameter is used for secure deletion, so generates random patterns (which is real traffic – albeit 100% sequential writes).  The syntax is like:

sdelete -c e:\

You can get SDelete from here:

http://technet.microsoft.com/en-us/sysinternals/bb897443 

(updated April 20, 2012 – I found in version 1.6 of SDelete the meaning of the -z and -c parameters got swapped.  In version 1.6 if you want random patterns use -c, if you want zeros use -z.  In previous versions it is the other way around!).

—————————————————————————————-

Just doing file copies is probably the worst way to generate benchmarks, especially as a single copy is usually a single threaded operation.

I am sure there are plenty of other tools out there to generate benchmarks and simulate workload.   My main concern with many of them is that synthetic (artificial) workloads do not reflect real world workloads.

So what are your favourite tools and why?

Posted in Brocade, Cisco, DS8800, IBM, IBM Storage, IBM XIV, Storwize V7000, SVC, Tivoli Storage, vmware | Tagged , , , , , , , , , , , | 2 Comments

What Cisco firmware do I recommend for your MDS Switches?

Right now I am working on giving a client a recommended version of firmware for their Cisco MDS Fibre Channel switches.  For FICON, the recommendations are easy, but for Open Systems there are so many choices.  So what am I going to recommend?

FICON Switches and Directors

For FICON  switches, sticking to the FICON (IBM Mainframe Fibre Connection) recommended versions (which are determined by the IBM System z Mainframe team), is a very good strategy.   The best place to get these is here (standard IBM logon is required).  Just look along the right hand column for the release letters.

The SAN-OS and NX-OS release notes found on the Cisco website also show recommended versions for FICON.   For instance have at the look at the FICON recommendations table in the releases notes for version 5.2.2a that you can find here.   The upgrade path is just below the table I have linked to.  This link will get outdated over time (as newer versions come out), but you can list all the release notes here.

If you are using a IBM TS7700 you should also be aware of this page on the IBM Techdocs site.

So based on current versions, if you are running SAN-OS 3.3.1c or below you need to move to 4.2.7b (as per the non-disruptive upgrade path).   I strongly recommend you get to at least version 4.2.7b and start planning to move to release 5.2.2 (provided your hardware supports it).

Open Systems

For open systems attached Fibre Channel switches there are a number of versions to choose from.  There are five things to consider:

  1. Being on the very latest version has a small potential risk (of un-discovered bugs).   However being on very old versions has a greater implicit risk (of being exposed to KNOWN bugs).  Just because you have not hit a bug yet, does not insure you from potential issues, especially if your SAN is growing.
  2. Your hardware. Some older Generation hardware is not supported at higher levels (for example Supervisor-1 cards cannot go past SAN-OS 3.3.5b) but later generation hardware is not supported at lower levels (for example Fabric 3 modules need NX-OS 5.2.2).  The Cisco recommended versions page is the best place to confirm this.
  3. End of life.  As SAN-OS reached end of development in 2011, 3.3.5b is the best choice for all hardware that cannot upgrade to NX-OS.   However be aware that some Cisco Generation 1 hardware (such as 2 Gbps capable hardware) will go end of service in September 2012 (for example Supervisor-1 cards and MDS 9120 switches).   Links for this are below.  Of course your service provider may choose to offer support beyond the Cisco end of life date, but instead of updating code, maybe you should be updating hardware.
  4. You need to also upgrade your Fabric Manager to at least the same or a higher version than your switches are running.   One important thing to be aware of is that from version 5.2, Cisco Fabric Manager has been merged into a new product called Cisco Data Center Network Manager (DCNM).
  5. You need to be on at least NX-OS 4.2.7 or 5.04 because these releases introduce the slow drain and congestion detection feature.   This is a must have for every busy SAN.

So what this mean is that for open systems as at April 2012, I recommend you install 3.3.5b for Gen1 hardware, 4.2.7e for Gen2 and Gen3 and 5.2.2a for Gen4 hardware.

For more details on when things are going end of life, check the following websites:

End-of-Sale and End-of-Life for the Cisco MDS 9000 SAN-OS Software Release 3.x
Cisco MDS Director Modules End-of-Life and End-of-Sale Notices
Cisco MDS 9100 End of Life and End of Sale Notices
Cisco MDS 9200 End of Life and End of Sale Notices

Finally it is well worth bookmarking the following links to help you with any updates (the middle links needs a Cisco CCO login):

Cisco Release Notes
Cisco MDS 9000 NX-OS 4.2(x) and SAN-OS 3.3(x) Upgrade and Downgrade Guide
Cisco NX-OS Release 5.0(1a) and SAN-OS 3.3(x) Upgrade and Downgrade Guide
Cisco NX-OS Release 5.2  Upgrade and Downgrade Guide

And thanks to Glen Routley and Filiph Westman for proof reading this post.

Posted in Cisco, IBM Storage, SAN, System z, Uncategorized | Tagged , , , , , , , | 4 Comments

How to spot an old IBMer

If you work (or have worked) for IBM then you have probably met many old timers.   IBMers who have been with the company for 25 years or more (or even 50!).

But how do you spot an old IBMer?

Is it by the cut of their suit?   Not sure about that anymore.

An IBM General Systems Division marketing rep in New Jersey in 1978.

It’s certainly not by their extensive beards.

Development of the 3800 printer, taken in the early 1970s by Ray Froess (http://www.froess.com/IBM/3800printer.htm)

Is it by the size of their laptop?  I hope not!

IBM 5100 Portable Computer (1975)

No… you can spot them by their use of certain words and phrases.

Here are a few I can think of… you may know more.   Try this out as a test on someone who you think is an old IBMer and see how they go:

1)  While showing a powerpoint presentation they keep saying they are showing foils (despite having not seen an overhead projector in over 10 years).

2)  They refer to disk storage as DASD (pronounced Dazz-Dee).

3)  They still call a Sales Rep a Marketing Rep (check out Buck Roger’s book The IBM Way to see why).

4)  They refer to their inbox as their reader (see #6 below).

5)  They refer to the IBM corporate personnel database as callup (it has been a Web based application called BluePages for around 15 years).

6)  If you say I will PROFS you (or I will send you a PROFS mail), they don’t blink an eye-lid  (PROFs was IBM’s Mainframe based mail system, replaced by OfficeVision which was replaced by Lotus Notes in the 1990s).

7)  If you say you F4ed or PF4ed an email…  they know what you mean (it meant that you deleted it in PROFS/OfficeVision).

8)  They reveal they are a veteran of IBM Typewriters by regaling you with their knowledge of Selectric Rotate Tapes.

9)  They can name the dimensions of a punched card.

10)  You look around the office and they are the only one still wearing a tie.

Go and test it out today.  See if you can find someone who can score 100%.

And have a great weekend…

Posted in IBM | Tagged , , , , , , , , , | 102 Comments

208 day reboot bug

It is ironic that only days after I wrote that 497 is the IT number of the beast, I learn that Linux has another unfortunate number:  208.

The reason for this is a defect in the internal Linux kernel used in recent firmware levels of SVC, Storwize V7000 and Storwize V7000 Unified nodes.  This defect will cause each node to reboot after 208 days of uptime.   This issue exists in unfixed versions of the 6.2 and 6.3 level of firmware, so a large number of users are going to need to take some action on this (except those who are still on a 4.x,  5.x, 6.0 or 6.1 release).   If you have done a code update after June 2011, then you are probably affected.   This means that if you are an IBM client you need to read this alert now and determine how far you are into that 208 day period.   If you are an IBMer or an IBM Business Partner, you need to make sure your clients are aware of this issue, though hopefully they have signed up for IBM My Notifications and have already been notified by e-mail.

In short what needs to happen is that you must:

  1. Determine your current firmware level.
  2. Check the table in the alert to determine if you are affected at all, and if so, how far you are potentially into the 208 day period.
  3. Use the Software Upgrade Test Utility to confirm your actual uptime.
  4. Prior to the 208 day period finishing, either reboot your nodes (one at a time, with a decent interval between them) or install a fixed level of software (as detailed in the alert).

To give you an example of the process, my lab machine is on software version 6.3.0.1 which you can see in the screen capture below.  So when I check the table in the alert, I see that version 6.3.0.1 was made available on January 24, 2012, which means the 208 day period cannot possibly end before August 19, 2012.

Version Number Release Date Earliest possible date that a system running this release could hit the 208 day reboot.
SAN Volume Controller and Storwize V7000 Version 6.3
6.3.0.0 30 November 2011 25 June 2012
6.3.0.1 24 January 2012 19 August 2012

Regardless, I need to know the uptime of my nodes, so I download the Software Upgrade Test Utility (in case you have an older copy, we need at least version 7.9) and run it using the Upgrade Wizard (NOTE!  We are NOT updating anything here, just checking):

I Launch the Upgrade Wizard, use it to upload the tool and follow the prompts to run it, so that I get to see the output of that tool. The output in this example shows the uptime of each node is 56 days, so I have a maximum of 152 days remaining before I have to take any action.  At this point I select Cancel.   You can run this tool as often as you like to keep checking uptime.

Note if you are on 6.1 or 6.2 code you may see a timeout error when running the tool, especially for the first time.  If you do see an error, please follow the instructions in the section titled “When running the the upgrade test utility v7.5 or later on Storwize V7000 v6.1 or v6.2″  at the Test Utility download site.

As per the Alert:

  • If you are running a 6.0 or 6.1 level of firmware, you are not affected.
  • If you are running a 6.2 level of firmware, the fix level is v6.2.0.5 which is available here for Storwize V7000 and here for SVC.
  • If you are running a 6.3 level of firmware, the fix level is v6.3.0.2 which is available here for Storwize V7000 and here for SVC.
  • If you are using a Storwize V7000 Unified, the fix level is v1.3.0.5 which is available here.

You should keep checking the alert to find out any new details as they come to hand.  If you are curious about Linux and 208 day bugs,  try this Google search.

If you have any questions or need help, please reach out to your IBM support team or leave me a comment or a tweet.

*** April 4:   Updated the blog post with links for all fix levels ***

*** April 10:   The IBM Web Alert has been updated with new information on what to do if your uptime has actually gone past 208 days without a reboot.  In short you still need to take action.  Please read the updated alert and follow the instructions given there. ***

Posted in IBM Storage, Storwize V7000, SVC | Tagged , , , , , , , , | 8 Comments

Who on earth is Gold Major and why is he sending me emails?

I got a great question recently:

We just updated our Cisco MDS9509s to NX-OS 4.2.7b (from Cisco SAN-OS 3.3.1c) and now we are getting emails from this source:   GOLD-major.

The actual message looks like this:

Time of Event:2012-03-05 15:07:21 GMT+00:00 Message Name:GOLD-major Message Type:diagnostic System Name:xxxxx Contact Name:xxxx@xxx.com Contact Email:xxx@xxx.com Contact Phone:+61-3-xxxx-xxxx Street Address:xx Road, xxxx, VIC, Australia Event Description:RMON_ALERT
 WARNING(4) Falling:iso.3.6.1.2.1.31.1.1.1.10.18366464=2401032512 <= 4680000000:135, 4 Event Owner:ifHCOutOctets.fc4/5@w5c260a03c162
 ThresholdType:FallingThreshold

So who is GOLD-major?

GOLD actually stands for Generic OnLine Diagnostics.  From Cisco’s website:
GOLD verifies that hardware and internal data paths are operating as designed. Boot-time diagnostics, continuous monitoring, and on-demand and scheduled tests are part of the Cisco GOLD feature set. GOLD allows rapid fault isolation and continuous system monitoring.   GOLD was introduced in Cisco NX-OS Release 4.0(1).  GOLD is enabled by default and Cisco do not recommend disabling it.

So in our example GOLD is actually reporting a major event (to do with exceeded thresholds, in this example utilisation on interface fc4/5).

Most clients using Cisco MDS switches are now moving to NX-OS (over SAN-OS, the name Cisco used for MDS firmware between version 1 and version 3) so this question will become more common.  I am working on a post that discusses recommended versions (and the sunsetting of SAN-OS), so expect something soon.   If on the other hand you are thinking…. how do I setup call home on a Cisco MDS switch?   The information for NX-OS is here.

Curiously my brain cannot help itself, when I hear Gold Major I think it means Gold Leader which leads me to Red Leader which leads me to Red October.   Maybe it’s just me?  Enjoy:

Posted in Cisco, IBM Storage, SAN | Tagged , , , , , , , , | 3 Comments

The IT number of the Beast – doubled!

http://en.wikipedia.org/wiki/M-497_Black_Beetle

Experimental jet-powered locomotive test bed

Last year I blogged about 497 being the IT number of the beast.

Why 497?

Because if a product uses a 32 bit counter to record uptime, and that counter records a tick every 10 msec, then that 32-bit counter will overflow after approximately 497.1 days.  This is because a 32 bit counter equates to 2^32, which equals 4,294,967,296 ticks.  If a tick is counted every 10 msec, we create 8,640,000 ticks per day (100*60*60*24).  So after 497.102696 days, the counter will overflow.   What happens next depends on good programming:  normally the counter just starts again, but worst case a function might stop working or the product might even reboot.

Fortunately we are seeing less and less of these issues but just occasionally one still slips  out.  Recently IBM released details of a 994 day reboot bug in the ESM code of some of their older disk enclosures (EXP100, EXP700 and EXP710).   Details about this bug can be found here.  What I find interesting is the number of days it takes to occur, since 994 is actually 497 times two.  This suggests that this product records a tick every 20 msec.  This meant we got past 497 days without an issue but hit a problem after exactly double that number.   So if you still have these older storage enclosures, you will need to reboot the ESMs (after checking the alert).

I googled 497 to see what images that number brings up and was amazed to find the M-497  jet powered train.   More details on this rather interesting attempt at speeding up the commute home can be found here and here.   It adds a whole new meaning to keeping behind the yellow line.

Posted in Brocade, Cisco, IBM Storage | Tagged , , , , , , | 6 Comments