It is ironic that only days after I wrote that 497 is the IT number of the beast, I learn that Linux has another unfortunate number: 208.
The reason for this is a defect in the internal Linux kernel used in recent firmware levels of SVC, Storwize V7000 and Storwize V7000 Unified nodes. This defect will cause each node to reboot after 208 days of uptime. This issue exists in unfixed versions of the 6.2 and 6.3 level of firmware, so a large number of users are going to need to take some action on this (except those who are still on a 4.x, 5.x, 6.0 or 6.1 release). If you have done a code update after June 2011, then you are probably affected. This means that if you are an IBM client you need to read this alert now and determine how far you are into that 208 day period. If you are an IBMer or an IBM Business Partner, you need to make sure your clients are aware of this issue, though hopefully they have signed up for IBM My Notifications and have already been notified by e-mail.
In short what needs to happen is that you must:
- Determine your current firmware level.
- Check the table in the alert to determine if you are affected at all, and if so, how far you are potentially into the 208 day period.
- Use the Software Upgrade Test Utility to confirm your actual uptime.
- Prior to the 208 day period finishing, either reboot your nodes (one at a time, with a decent interval between them) or install a fixed level of software (as detailed in the alert).
To give you an example of the process, my lab machine is on software version 18.104.22.168 which you can see in the screen capture below. So when I check the table in the alert, I see that version 22.214.171.124 was made available on January 24, 2012, which means the 208 day period cannot possibly end before August 19, 2012.
|Version Number||Release Date||Earliest possible date that a system running this release could hit the 208 day reboot.|
SAN Volume Controller and Storwize V7000 Version 6.3
|126.96.36.199||30 November 2011||25 June 2012|
|188.8.131.52||24 January 2012||19 August 2012|
Regardless, I need to know the uptime of my nodes, so I download the Software Upgrade Test Utility (in case you have an older copy, we need at least version 7.9) and run it using the Upgrade Wizard (NOTE! We are NOT updating anything here, just checking):
I Launch the Upgrade Wizard, use it to upload the tool and follow the prompts to run it, so that I get to see the output of that tool. The output in this example shows the uptime of each node is 56 days, so I have a maximum of 152 days remaining before I have to take any action. At this point I select Cancel. You can run this tool as often as you like to keep checking uptime.
Note if you are on 6.1 or 6.2 code you may see a timeout error when running the tool, especially for the first time. If you do see an error, please follow the instructions in the section titled “When running the the upgrade test utility v7.5 or later on Storwize V7000 v6.1 or v6.2″ at the Test Utility download site.
As per the Alert:
- If you are running a 6.0 or 6.1 level of firmware, you are not affected.
- If you are running a 6.2 level of firmware, the fix level is v184.108.40.206 which is available here for Storwize V7000 and here for SVC.
- If you are running a 6.3 level of firmware, the fix level is v220.127.116.11 which is available here for Storwize V7000 and here for SVC.
- If you are using a Storwize V7000 Unified, the fix level is v18.104.22.168 which is available here.
If you have any questions or need help, please reach out to your IBM support team or leave me a comment or a tweet.
*** April 4: Updated the blog post with links for all fix levels ***
*** April 10: The IBM Web Alert has been updated with new information on what to do if your uptime has actually gone past 208 days without a reboot. In short you still need to take action. Please read the updated alert and follow the instructions given there. ***