I once sat in a project meeting in which the Project Manager declared that:
Default settings are always the best settings, since they were the ones the vendor made default!
While you may think there is some logic in the statement, it is a flawed belief.
While it’s true that in many cases the default settings may cover the most common implementations, there is no guarantee of safety in leaving everything at defaults.
Equally there is great danger with monkeying with all the bells and whistles if you are not sure what they will do!
A classic example I keep seeing is AIX Fibre Channel HBA settings, in particular for error recovery and dynamic tracking.
AIX was in existence a long time before Fibre Channel came into common use. I/O in those days normally travelled down a single path to a single device, or via a common SCSI cable, off which hung multiple devices like very large hard drives (well physically large but logically small). So if there was a glitch on the link, it was better to wait awhile for the link to come back than to declare the link dead, since there was no other way to get to those devices.
However once multipath Fibre Channel became common, it made sense to allow more control over this behaviour.
AIX has two settings that affect how link failures are handled (caused by an HBA failure, switch port failure, cable failure, someone disconnecting the wrong cable, etc).
Fast Failure of an I/O path is controlled by a fscsi device attribute called fc_err_recov.
The default setting for this attribute is delayed_fail (which I call slow failure). You can instead set it to fast failure. This setting influences what happens when the adapter driver receives a message from the fibre channel switch that there is a link event.
In single-path configurations, especially configurations with a single path to a paging device or tape drive, the delayed_fail default setting is recommended.
So paths to tape drives or to paging devices should use delayed_fail, while paths to everything else should use fast_fail.
With AIX regardless of what multi-pathing software is in use, if a path fails, there will most likely be a pause in I/O processing What happens is at the time of the path failure is that some I/O has already been issued to the ‘bad’ path. After 15 seconds the path is failed and that I/O is resent down a different path. With delayed fail, this pause can be as long as 40 seconds.
What should you look for? This is the default (normally less ideal) situation:
# lsattr -E -l fscsi0 attach switch How this adapter is CONNECTED False dyntrk no Dynamic Tracking of FC Devices True fc_err_recov delayed_fail FC Fabric Event Error RECOVERY Policy True scsi_id 0x630f00 Adapter SCSI ID False sw_fc_class 3 FC Class for Fabric True
These are my recommended settings:
# lsattr -El fscsi0 attach switch How this adapter is CONNECTED False dyntrk yes Dynamic Tracking of FC Devices True fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True scsi_id 0x630f00 Adapter SCSI ID False sw_fc_class 3 FC Class for Fabric True
The ‘True’ at the end of the line means the value can be changed, but not necessarily while the device is in use.
So if you find you are not running using the ideal settings and it makes sense to change them, run these two commands against each relevant fscsi device and then reboot at your leisure since it will only change the ODM (unless you can unmount affected file systems and vary off affected VGs).
chdev -l fscsi0 -a dyntrk=yes -P chdev -l fscsi0 -a fc_err_recov=fast_fail -P
Note we are also setting dynamic tracking to yes. This setting allows AIX to learn that the fibre channel port ID of a device has changed on the fly. This is handy if you need to move a cable to a different port or switch (where you are zoning by WWPN and you have a need to reconfigure on the fly).
The readme for AIX 5.2 (which applies equally to higher versions) explains all of this behaviour here: