The third law of the Time Lords

There is a great old saying that the marvellous thing about standards is that there are so many of them….  and things don’t get much worse than date and time formats.   Are we using DD-MM-YY or MM-DD-YY or DD-MM-YYYY…   I could go on for some time, nothing sets my teeth on edge more than not being certain what date format I am looking at.

Screen Shot 2013-04-14 at 1.11.38 PM

What amazes me is how many consumer oriented application still don’t respect regional time settings or use a common standard.    I recently evaluated CrashPlan for instance as an alternative to DropBox.   I was fairly surprised to find that the CrashPlan logs were all in USA date format (even though I am using the CrashPlan Australia portal).

2013-05-15_20-27-22Sadly when I compared this to Dropbox I found exactly the same problem!   Not exactly a ‘born-global’ attitude.

2013-05-15_20-29-57

The good news is that there a common standard for date stamps that some IT vendors are now sticking to, which is called  ISO 8601.     A date stamp in ISO 8601 format always uses YYYY-MM-DD.   A great example of a log file that follows this standard is the vmware.log file of any Virtual Machine:

Screen Shot 2013-05-15 at 8.33.38 PM

There is a great write-up of the format and structure for ISO 8601 here:  http://www.cl.cam.ac.uk/~mgk25/iso-time.html

So why do I care about date formats (and why should you?).  The answer is error logs.

In todays IT world no system is an island.  There is always interaction.  When you want to understand root cause you need to understand event order.   If you cannot trust the time stamps (because NTP is not in use and the timezones are wrong) and you cannot combine the logs (because the time and date formats are wildly different), then you are in trouble.  So next time you are evaluating a solution, scare your vendor and ask if they support ISO8601.  Tell them to get on board.

This brings us to the three rules of time management (the three laws of the Time Lords):

  1. Always co-ordinate your time.
  2. Always get your timezone correct.
  3. Always use a standard time and date format.

And remember…  there is only person who can break the laws of time:

tumblr_lvplt00WA11r2yv6bo1_500

Posted in advice, SAN, vmware | Tagged , , | 4 Comments

The second law of the Time Lords

In my last post I reminisced about an airline sending its passengers out early due to poorly coordinated clocks.   But even when the clocks are right, there is something else that can let confusion reign:   The Time Zone and its good friend Daylight Savings Time (DST).

In the late 90s when I worked at Mainframe accounts, we would always get an hour at the end of daylight savings to get in some urgent maintenance.   Why an hour?   Because when we set the clocks back at the end of DST the client would have to shut everything down for that hour.

The reason being that transactions were being logged in local time.  So if the local time went backwards by one hour then  transactions logged in the subsequent hour would appear to be earlier than  transactions logged in the previous hour!   This could not be allowed to happen so the simple solution was to allow no transactions to occur at all.

The solution to this madness was to switch to logging all transactions in GMT (or UDT) and then log a local offset to reflect what the local time was as the time of the transaction.  I worked with one New Zealand Bank who had set up their systems to believe that local time actually was GMT (instead of +12 from GMT).  They had to shut their systems down for a total of 12 hours (across multiple weekends) to correct this sin.   Painful.

It’s bad enough living in Australia where in Summer we have five time zones (Western, Central, Central DST, Eastern, Eastern DST).

Aust TZ

Compare this to China (1.3 billion people in a nation 4893km east to west) or India (1.2 billion in a nation 2933km East to West)  where both nations use single time zones.   Australia in comparison is a nation of 22 million with up to 5 time zones (though admittedly it is also quite wide (east to west), at 4100km).  I wonder if anyone has ever written a PhD paper on the economic benefit of a single timezone?

Not that I am planning to urge Australians to abandon their resistance to universal daylight savings, but when you have worked in IT support this is yet another variable you need to struggle with.   Especially as our governments keep monkeying with the start and stop dates (something that has caused all manner of unforeseen issues).   My single advice is to not only use NTP, but to ensure the timezone is set correctly on all of your IT infrastructure.   I have talked about the Brocade Logon panel of shame in the past, but the timezone of shame is close behind.   This switch is not in England and yet it’s timezone is UTC.

2013-05-08_19-36-21

For many products, timezone changes only take effect after a reboot, so it’s vital to get them set at install time.   You don’t want to have an issue, then during investigation have that issue compounded by confusion about date stamps due to mixed time zones.

And if you are looking for help with time zones and you are a Mac or iPhone user, I can strongly recommend Time Scroller, an app I use on a very regular basis.

Recommendations for other apps are very welcome.

Posted in advice | Tagged , , , | Leave a comment

The first law of the Time Lords

tardisclock

Let me share a war story with you (one that I swear is true).

In the year 2000 an Ansett passenger jet, loaded with passengers, was preparing to push back from the terminal in Sydney.   But there was a timing problem…  the control tower insisted they were trying to depart too early…  three minutes too early.

What resulted was a dispute about the actual time, which concluded with the control tower politely suggesting Ansett ring 1194 (the phone number in Australia to get the talking clock).   One phone call later and Ansett could see the problem:   Their mainframe was 3 minutes fast.   Seriously!

6137824379_759475932e_z

The root cause of this problem was simple:  while Ansett used sysplex timers (Mainframe timing devices) to co-ordinate time between their mainframes,  nothing had been setup to co-ordinate the sysplex timer time with an external source.  And sadly the time had drifted.

IBM Sysplex Timer

IBM Sysplex Timer

The solution was simple.   We attached a modem to the sysplex timer and it began calling the NIST in Boulder Colorado to check for drift (this product had no concept of NTP).  Meanwhile the Sysplex Timer had to very slowly (and it took a while) drift it’s reported time back to reality.

So Ansett had broken the first law of the Time Lords:  Always sync your clocks.   Pretty well every product in your data center can sync with NTP and there is no excuse not to use it.

So given my very strong views on this,  I am curious, have I missed something?   Are there reasons to NOT use NTP?   Have you seen objections?   Please share your thoughts.

And Ansett?  Sadly no longer with us.  The airline failed in September 2001, their planes sold off or broken down for scrap.   A sad day for Australian aviation and all the people who worked for that fine airline.

Ansett-767-N407AN-070405-01-12

Posted in Brocade, Cisco, IBM | Tagged , , | 3 Comments

IBM Scripting Tools for SVC and Storwize has been updated

updates

The Rebalance script for IBM SVC has been updated.  This is the first update I have seen since 2010.     This release of the SVCTools package will now work on the Storwize family of products without modification and it can rebalance Easy Tier managed disk groups.

Why use this?   Normally in SVC (and always with Storwize products), each MDisk is a different RAID array.   In those cases, when you add MDisks (arrays) to a pool (MDiskGrp) then you are adding extra spindles to that pool.   By rebalancing extents from existing volumes onto new MDisks, existing volumes will almost always get a performance boost.   It also means that you free up space on the older MDisks so that when you later create new volumes in that pool, they will get a chance to use extents across a wider range of MDisks (old and new, not just new).  This is particularly useful if the pool was full when you added extra space.

Of course if you are using Easy Tier, then the hardest working extents will already be on SSD, so rebalancing may not be as beneficial.   Note that Easy Tier will be disabled while the balance.pl script is running and extents on SSD (presumably put there by Easy Tier) will not be rebalanced or moved.  Also if your backend disk is already wide-striped (like the IBM XIV) then rebalancing is not necessary.

For more info check out the scripting community here:   Community URL
Download the latest version of the package here:  Download URL

Posted in IBM, IBM Storage, SAN, Storwize V3700, Storwize V7000, SVC | Tagged , , , | 5 Comments

Murphys Law in Action

A friend of mine recently had a problem while installing a new SAN24B-4 (an IBM 2498B24, which is a rebadged Brocade 300 Fibre Channel switch).

The problem was simple:   It was dead.

its-dead-jim

The switch would not power up.  Swapping the power cord and outlet made no difference.
He opened a service call and got a callback from IBM support.   Their suggestion?

Loosen the 4 screws holding the mounting rails.

This seemed like a voodoo fix, but the screws were loosened and voila!    The switch was miraculously brought to life.

So whats the deal?

The heart of the problem is the Brocade rail kits shipped by IBM.  While they have the simple advantage that they can be installed into a great variety of  racks, they are in practical reality quite awful.  Check out all the parts (hope you brought your screwdriver!):

2013-04-14_12-43-05

It turns out they come with a huge selection of screws to attach the rails to the switch.   You have to choose the really short ones (they are 3/16″ long and are #6 in the diagram above).     The reason this causes a problem is that a longer screw may reach far enough into the case to potentially short out the coil on the power supply.   While this is clearly documented in the install guide (and with labels on the switch), this kind of crazy trap for the unwary is quite annoying.

IMG_20130412_132601_256

Murphys Law in this case is simple:   If it can be done wrong – someone will eventually do it wrong.

Given how amazingly easy the IBM server and storage rail kits are to install, it bamboozles me why these SAN switch rail kits are stuck in the 1970s?

Comments and war stories welcome.

Posted in Brocade, IBM, SAN | Tagged , , , , | 11 Comments

Enhancement requests for IBM SVC and Storwize V7000

If you have ever wanted to suggest ways to enhance IBM products you should check out the IBM RFE Page here:     http://www.ibm.com/developerworks/rfe/

I was blissful unaware of this page so found it rather fascinating to check out some of the ideas that people were suggesting.   While the page is clearly IBM Software focused, you can in fact suggest improvements to IBM’s SVC and Storwize V7000 products.   Just choose the Tivoli Brand and then the Storage Product family.

RFE Site

To use the site you will need an IBM ID which you can create quite easily here.

Some good examples of RFEs you could vote for include:

25498     Disk drive upgrade utility for Storwize V7000
26193     Add SCSI Unmap primitive

Search for RFEs here.

For further reading about the site checkout Scott Laninghams blog here (which contains a tutorial on how to submit requests) and Daryl Pereira’s blog here (which has a presentation on the RFE Community).

Posted in IBM, IBM Storage, Storwize V3700, Storwize V7000, SVC | Tagged , , | 3 Comments

Performance monitoring – what is the least you expect?

If there is one part of Enterprise Storage where product delivery sometimes falls down, it is performance monitoring.   It appears to be standard practice across most of the major vendors to offer separate highly priced Storage Resource Management products if you want to get quality performance data from your Enterprise Storage kit.   When I was at IBM we struggled when we struck SAN performance issues that needed quality data if the client did not have IBM Tivoli Storage Productivity Manager (TPC).  Even worse was that we also struggled to sell TPC it if the client had not purchased it with the kit itself.   Many clients felt the list price didn’t match the perceived benefit.   And to be frank this is not just an IBM problem.   I routinely meet clients and read tender requests (including those with EMC kit and HP kit), who lament their current lack of storage performance monitoring tools.

One solution for a vendor is to add a Performance monitoring tool in their products management GUI.   An example that I really liked was the Performance monitor added to the IBM SVC and Storwize V7000.   However while this was a huge step forward, these sort of tools suffer from what I call the Dory effect (named after the fish in Finding Nemo who had short-term memory issues) since you cannot see anything older than the most recent five minutes.   One simple reason for this short-term memory is that to retain long-term data you need an easily searchable database with plenty of storage.

The IBM XIV on the other hand has set a far better precedent in this regards.   It retains 30 days of detailed stats and one year of averaged stats internally with multiple ways of accessing that data via both GUI and CLI.

One further innovation that the XIV has brought to the table is an iPhone and iPad Mobile Dashboard.   Version 1.2 has recently been released and is another major step forward.   It is a universal App meaning the same app now installs on both iPhone and iPad, but the real improvement is that you can now monitor up to 20 XIVs from one dashboard with a very data rich GUI.   Plus you can now drill down to check system events as well.

In this screen capture you can see stats for one demo XIV, with volumes in the center and hosts to the right.  You can drill down and change the focus between bandwidth, IOPS and latency.

Performance Monitor

In this screen capture you can see hardware events for a demo system.   Amusingly while demo mode does not show some of the crazy errors you get from development lab machines (which demo mode in the XIV GUI sometimes showed in earlier versions), they have been sanitised to the point that the simulated errors are all just mock ups of what real errors would be (I feel sorry for the developers, they cannot win either way).

XIV Hardware Events

You can get the free app from the Apple app store here and as usual you can run the app in demo mode, meaning you can check out this tool without owning an XIV.   It really is a great app.   Why not install it and show it to your EMC or HP (or IBM) rep and say… this is what I want for every product.

Frankly I think this open attitude to performance and system monitoring is the least you should expect.

 

Posted in Uncategorized | 7 Comments