Don’t always default to default

I once sat in a project meeting in which the Project Manager declared that:

Default settings are always the best settings, since they were the ones the vendor made default!

While you may think there is some logic in the statement, it is a flawed belief.

While it’s true that in many cases the default settings may cover the most common implementations, there is no guarantee of safety in leaving everything at defaults.
Equally there is great danger with monkeying with all the bells and whistles if you are not sure what they will do!

A classic example I keep seeing is AIX Fibre Channel HBA settings, in particular for error recovery and dynamic tracking.

AIX was in existence a long time before Fibre Channel came into common use.   I/O in those days normally travelled down a single path to a single device, or via a common SCSI cable, off which hung multiple devices like very large hard drives (well physically large but logically small).  So if there was a glitch on the link, it was better to wait awhile for the link to come back than to declare the link dead, since there was no other way to get to those devices.

However once multipath Fibre Channel became common, it made sense to allow more control over this behaviour.

AIX has two settings that affect how link failures are handled (caused by an HBA failure, switch port failure, cable failure, someone disconnecting the wrong cable, etc).
Fast Failure of an I/O path is controlled by a fscsi device attribute called fc_err_recov.
The default setting for this attribute is delayed_fail (which I call slow failure).   You can instead set it to fast failure. This setting influences what happens when the adapter driver receives a message from the fibre channel switch that there is a link event.

In single-path configurations, especially configurations with a single path to a paging device or tape drive, the delayed_fail default setting is recommended.
So paths to tape drives or to paging devices should use delayed_fail, while paths to everything else should use fast_fail.
With AIX regardless of what multi-pathing software is in use, if a path fails, there will most likely be a pause in I/O processing  What happens is at the time of the path failure is that some I/O has already been issued to the ‘bad’ path. After 15 seconds the path is failed and that I/O is resent down a different path.   With delayed fail, this pause can be as long as 40 seconds.

What should you look for?   This is the default (normally less ideal) situation:

  # lsattr -E -l fscsi0 
attach        switch       How this adapter is CONNECTED         False
dyntrk        no           Dynamic Tracking of FC Devices        True
fc_err_recov  delayed_fail FC Fabric Event Error RECOVERY Policy True
scsi_id       0x630f00     Adapter SCSI ID                       False
sw_fc_class   3            FC Class for Fabric                   True

These are my recommended settings:

# lsattr -El fscsi0 
attach        switch    How this adapter is CONNECTED         False
dyntrk        yes       Dynamic Tracking of FC Devices        True
fc_err_recov  fast_fail FC Fabric Event Error RECOVERY Policy True
scsi_id       0x630f00  Adapter SCSI ID                       False
sw_fc_class   3         FC Class for Fabric                   True

The ‘True’ at the end of the line means the value can be changed, but not necessarily while the device is in use.
So if you find you are not running using the ideal settings and it makes sense to change them, run these two commands against each relevant fscsi device and then reboot at your leisure since it will only change the ODM (unless you can unmount affected file systems and vary off affected VGs).

chdev -l fscsi0 -a dyntrk=yes -P
chdev -l fscsi0 -a fc_err_recov=fast_fail -P

Note we are also setting dynamic tracking to yes.  This setting allows AIX to learn that the fibre channel port ID of a device has changed on the fly.   This is handy if you need to move a cable to a different port or switch (where you are zoning by WWPN and you have a need to reconfigure on the fly).

The readme for AIX 5.2 (which applies equally to higher versions) explains all of this behaviour here:

http://www-1.ibm.com/support/docview.wss?uid=isg1520readmefb4520desr_lpp_bos

Advertisements

About Anthony Vandewerdt

I am an IT Professional who lives and works in Melbourne Australia. This blog is totally my own work. It does not represent the views of any corporation. Constructive and useful comments are very very welcome.
This entry was posted in advice and tagged , . Bookmark the permalink.

One Response to Don’t always default to default

  1. Pingback: Don’t always default to default | Aussie Storage Blog | I Love My Storage

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s