Knowledge sharing in IT

Hopefully most of you know what Ted Talks are?     A truly marvelous collection of inspiring videos, usually 20 minutes or less that are nearly always worth watching.

I recently watched this one from Stanley McChrystal, former US Army General who gave me a unique view on information sharing. General McChrystal says something at one point that strikes me as phenomenally appropriate to IT (even though he was talking about military secrets).  He says:

“… as we passed that information around, suddenly you find that information is only of value if you give it to people who have the ability to do something with it. The fact that I know something has zero value if I’m not the person who can actually make something better because of it.”

 

So how does this apply to IT?

I have worked with support personnel who kept all of their secret commands in notepads that they kept concealed in their back pockets.  Luckily in many cases the UNIX history command let me learn all their secret incantations as soon as they were out of the room. I did work with one guy in remote support who would create a file with VI, populate it with his commands of power, make it executable and run it.   He left nothing in the UNIX history but VI, chmod 755 and the name of his secret file.   He was simultaneously a smart guy and a smart alec.

my-secret-diary-love-do-not-enter-131541109439

I have learnt that the motivation to keep commands secret often does not spring out of any misguided belief that they are keeping dangerous commands away from inexperienced people.  They are simply trying to make themselves indispensable.   Sadly in the meantime everyone else is left to reinvent the wheel, or wait for the right person to come online, plus relearn things that others already knew and repeat mistakes that others have already made.

This leads me to knowledge sharing.

There are three forms of knowledge sharing in the IT industry:

  1. Knowledge that vendors share with users
  2. Knowledge that users share with users
  3. Knowledge that users share with vendors

Now you may think that vendors sharing data with users is obvious, but three things stand in your way:

  • Portal walls.   Vendors who guard their knowledge bases with portal walls are protecting their intellectual property from free loaders and their inquisitive competition, while simultaneously forcing everyone to rely on their willingness to actually share and denying us googleability.
  • Poor sharing practices, such as readme documents that are vague or incomplete (or even non-existent).
  • Fear of other vendors marketing departments.  This fear drives IT companies to not share information, not out of fear of what their users will say, but out of fear of what their competitors will use that information for.

The good news is that each and every one of us has information we can share with each other.   Whether we do this in blogs, social media platforms (like forums or twitter) or just in hand written notes stuck up on the notice board.  It is in everybody’s hands to share what they know.  Find a forum, start a blog, send out emails.   Just do it.

And if your IT vendors are worth their salt, they will listen in.

And always remember:

…the fact that I know something has zero value if I’m not the person who can actually make something better because of it.

 

 

Posted in Uncategorized | 2 Comments

Don’t always default to default

I once sat in a project meeting in which the Project Manager declared that:

Default settings are always the best settings, since they were the ones the vendor made default!

While you may think there is some logic in the statement, it is a flawed belief.

While it’s true that in many cases the default settings may cover the most common implementations, there is no guarantee of safety in leaving everything at defaults.
Equally there is great danger with monkeying with all the bells and whistles if you are not sure what they will do!

A classic example I keep seeing is AIX Fibre Channel HBA settings, in particular for error recovery and dynamic tracking.

AIX was in existence a long time before Fibre Channel came into common use.   I/O in those days normally travelled down a single path to a single device, or via a common SCSI cable, off which hung multiple devices like very large hard drives (well physically large but logically small).  So if there was a glitch on the link, it was better to wait awhile for the link to come back than to declare the link dead, since there was no other way to get to those devices.

However once multipath Fibre Channel became common, it made sense to allow more control over this behaviour.

AIX has two settings that affect how link failures are handled (caused by an HBA failure, switch port failure, cable failure, someone disconnecting the wrong cable, etc).
Fast Failure of an I/O path is controlled by a fscsi device attribute called fc_err_recov.
The default setting for this attribute is delayed_fail (which I call slow failure).   You can instead set it to fast failure. This setting influences what happens when the adapter driver receives a message from the fibre channel switch that there is a link event.

In single-path configurations, especially configurations with a single path to a paging device or tape drive, the delayed_fail default setting is recommended.
So paths to tape drives or to paging devices should use delayed_fail, while paths to everything else should use fast_fail.
With AIX regardless of what multi-pathing software is in use, if a path fails, there will most likely be a pause in I/O processing  What happens is at the time of the path failure is that some I/O has already been issued to the ‘bad’ path. After 15 seconds the path is failed and that I/O is resent down a different path.   With delayed fail, this pause can be as long as 40 seconds.

What should you look for?   This is the default (normally less ideal) situation:

  # lsattr -E -l fscsi0 
attach        switch       How this adapter is CONNECTED         False
dyntrk        no           Dynamic Tracking of FC Devices        True
fc_err_recov  delayed_fail FC Fabric Event Error RECOVERY Policy True
scsi_id       0x630f00     Adapter SCSI ID                       False
sw_fc_class   3            FC Class for Fabric                   True

These are my recommended settings:

# lsattr -El fscsi0 
attach        switch    How this adapter is CONNECTED         False
dyntrk        yes       Dynamic Tracking of FC Devices        True
fc_err_recov  fast_fail FC Fabric Event Error RECOVERY Policy True
scsi_id       0x630f00  Adapter SCSI ID                       False
sw_fc_class   3         FC Class for Fabric                   True

The ‘True’ at the end of the line means the value can be changed, but not necessarily while the device is in use.
So if you find you are not running using the ideal settings and it makes sense to change them, run these two commands against each relevant fscsi device and then reboot at your leisure since it will only change the ODM (unless you can unmount affected file systems and vary off affected VGs).

chdev -l fscsi0 -a dyntrk=yes -P
chdev -l fscsi0 -a fc_err_recov=fast_fail -P

Note we are also setting dynamic tracking to yes.  This setting allows AIX to learn that the fibre channel port ID of a device has changed on the fly.   This is handy if you need to move a cable to a different port or switch (where you are zoning by WWPN and you have a need to reconfigure on the fly).

The readme for AIX 5.2 (which applies equally to higher versions) explains all of this behaviour here:

http://www-1.ibm.com/support/docview.wss?uid=isg1520readmefb4520desr_lpp_bos

Posted in advice | Tagged , | 1 Comment

Thin Provisioning Buyers Guide

Storage space consumption is always a major bone of contention in all data centers.    It seems 100 TB of new storage can fill up in a blink of an eye and then you have to buy some more.    But what to do?   Lets get below the covers to see what is happening.

When data is written to a volume (I am tempted to say disk, but since most disks are really virtual volumes, that may not actually write to a spinning disk, I will stick with volume) it is written by a file system (or disk space manager of some kind like ASM), to logical block addresses or LBAs (that are 512 bytes in size).    Space in a volume is addressed in LBAs starting at zero and going to the highest address that the disk size allows (so clearly a 5 TB volume has way more LBAs than a 5 GB volume).

From the host servers perspective, if a volume claims it has 5 TB of space available, then the server believes it has the right to write 5 TB.    It is quite common for storage controllers to allow storage administrators to over-allocate space.   Meaning that for a fixed quantity of physical capacity (say 75 TB) you could allocate 150 TB of volumes.    This is over allocation and is only made possible by thin provisioning sometimes combined with other space-saving methods (like compression and deduplication).  Normally over-allocation occurs by creating over-allocated storage pools.

An over-allocated storage pool means the administrator can create virtual volumes whose total volume size (when summed together), exceeds the available storage capacity of that pool.  In other words we can advertise more space than we actually have.   This means the volumes in the pool had better be space efficient in nature.

Now genuine space efficient volume design should follow five principles:

  1. When data gets written to the volume, allocate as little space from the pool as possible to hold that data.  In other words if I write 100KB to a volume, don’t allocate 100GB from the pool to that volume to hold that data.
  2. When zeros get written to the volume, allocate no space from the pool and preferably release the space occupied by those LBAs back to the pool.  In other words, if I write 1 MB of zeros, don’t allocate 1 MB of pool space to hold those zeros.  In fact, have a look at the LBAs I am writing to and if they include address ranges already allocated to the volume from the pool, see if we could de-allocate them from the volume and return that space to the pool.
  3. When allocated space is no longer needed, offer some way to release that space back to the pool (sounds like # 2 but is actually different).  In other words, if I delete a 1 GB file, then that’s really 1 GB of volume space I don’t need anymore.   The file system knows this, but does the underlying disk controller?
  4. If space is running short in the pool, give me plenty of warning so I can do something about it before everything goes wrong.
  5. If data is now being written in a thin fashion, then it is likely the data is not being written sequentially.  When combined with other space-saving technologies this should ideally not create performance issues.

So how well does your storage system do in this regard?    Over the next few posts I will explore these categories in greater depth.   If you have any other characteristics I have missed, happy to add them.

Posted in advice, Uncategorized | Tagged , , , | 2 Comments

Don’t look back in anger

It seems fairly obvious that as you get older you have more and more memories to look back on.  Some of these memories are happy….   some less so.    But seen through the golden haze of nostalgia many things that happened in the past start to become far more glorious than they really were.

I was born and grew up in Perth, so my childhood memories are all from that city.   Recently I found a Facebook page called Lost Perth, clearly run by someone who is close to my age, as the photos being posted really appeal to my sense of nostalgia.   Recently they posted a photo of Perth International Airport as it was back in the 1960s.    I can remember being in this very hall and it was a place of wonders.   When people came from far far away.   It seemed so amazing to my childs mind.

Old Perth Airport

Someone then immediately posted another photo of the same place.   Can you spot the problem?

Packed Airport

Why didn’t my childhood memories contain images of the arrivals hall as an arrivals hell? Maybe I didn’t want to remember it that way?

It’s a bit like your memories of life at former employers.   You can leave a company in anger, blaming terrible management or misguided market dominance plans or crazed short-term thinking…. but it won’t do you any good.   Choose instead to remember the golden years…  and don’t look back in anger.

Posted in Uncategorized | 2 Comments

Your DS3500 needs new firmware to support T10-PI

For those of you who use the IBM DS3500 (a midrange storage controller), you should ensure all your machines are on firmware release 07.86.32.00 or higher since this adds support for T10-PI.   This is because new additional or replacement disk drives may require that support.   Inserting high-level drives into down-level machines can result in a failed drive replacement or unexpected errors.   Ideally you should not be upgrading your machines while there is a failed component, so I recommend you pro-actively upgrade your DS3500s, particularly if you are ordering new drives or additional enclosures.

Note that IBM recommend 07.86.39.00 on this page.  New firmware can be downloaded from here.

If you wondering what on earth T10-PI is, check out this blog here.    If you use AIX there is also a short write up here.  It does not mention DS3500, but I think this is due to the age of the post.

You can tell that T10-PI support is enabled for an array very easily in the upgraded GUI.

T10-PI Screen Cap

Posted in Uncategorized | 1 Comment

Innovate, emulate or evaporate.

AngryDinosaur

The IT Industry is changing rapidly.   New disruptive technologies are changing the whole playing field and vendors who just talk about backup are going the way of the dinosaur.   Actifio saw this more than four years ago and began a new era of Copy Data Management. Finally the other guys are starting to realize the ground has shifted below their feet and have begun talking about doing exactly the same thing (without actually changing anything that they currently do).

But don’t just listen to me, have a read of Chris Mellors analysis and watch the EMC video.   Then talk to Actifio and get today what EMC cannot deliver tomorrow.

http://www.theregister.co.uk/2013/11/28/emc_blurts_backup_is_broken/

Posted in Actifio | Tagged , , , , | Leave a comment

Backblaze Blog » How long do disk drives last?

This is fascinating stuff that pretty well exactly matches my experience with almost any IT product.  Congrats to Backblaze for collecting and sharing this information.

Backblaze Blog » How long do disk drives last?.

Posted in advice | Leave a comment