208 day reboot bug

It is ironic that only days after I wrote that 497 is the IT number of the beast, I learn that Linux has another unfortunate number: 208.

The reason for this is a defect in the internal Linux kernel used in recent firmware levels of SVC, Storwize V7000 and Storwize V7000 Unified nodes. This defect will cause each node to reboot after 208 days of uptime. This issue exists in unfixed versions of the 6.2 and 6.3 level of firmware, so a large number of users are going to need to take some action on this (except those who are still on a 4.x, 5.x, 6.0 or 6.1 release). If you have done a code update after June 2011, then you are probably affected. This means that if you are an IBM client you need to read this alert now and determine how far you are into that 208 day period. If you are an IBMer or an IBM Business Partner, you need to make sure your clients are aware of this issue, though hopefully they have signed up for IBM My Notifications and have already been notified by e-mail.

In short what needs to happen is that you must:

Determine your current firmware level.
Check the table in the alert to determine if you are affected at all, and if so, how far you are potentially into the 208 day period.
Use the Software Upgrade Test Utility to confirm your actual uptime.
Prior to the 208 day period finishing, either reboot your nodes (one at a time, with a decent interval between them) or install a fixed level of software (as detailed in the alert).

To give you an example of the process, my lab machine is on software version 6.3.0.1 which you can see in the screen capture below. So when I check the table in the alert, I see that version 6.3.0.1 was made available on January 24, 2012, which means the 208 day period cannot possibly end before August 19, 2012.

Version Number

Release Date

Earliest possible date that a system running this release could hit the 208 day reboot.

SAN Volume Controller and Storwize V7000 Version 6.3
6.3.0.0	30 November 2011	25 June 2012
6.3.0.1	24 January 2012	19 August 2012

Regardless, I need to know the uptime of my nodes, so I download the Software Upgrade Test Utility (in case you have an older copy, we need at least version 7.9) and run it using the Upgrade Wizard (NOTE! We are NOT updating anything here, just checking):

I Launch the Upgrade Wizard, use it to upload the tool and follow the prompts to run it, so that I get to see the output of that tool. The output in this example shows the uptime of each node is 56 days, so I have a maximum of 152 days remaining before I have to take any action. At this point I select Cancel. You can run this tool as often as you like to keep checking uptime.

Note if you are on 6.1 or 6.2 code you may see a timeout error when running the tool, especially for the first time. If you do see an error, please follow the instructions in the section titled “When running the the upgrade test utility v7.5 or later on Storwize V7000 v6.1 or v6.2″ at the Test Utility download site.

As per the Alert:

If you are running a 6.0 or 6.1 level of firmware, you are not affected.
If you are running a 6.2 level of firmware, the fix level is v6.2.0.5 which is available here for Storwize V7000 and here for SVC.
If you are running a 6.3 level of firmware, the fix level is v6.3.0.2 which is available here for Storwize V7000 and here for SVC.
If you are using a Storwize V7000 Unified, the fix level is v1.3.0.5 which is available here.

You should keep checking the alert to find out any new details as they come to hand. If you are curious about Linux and 208 day bugs, try this Google search.

If you have any questions or need help, please reach out to your IBM support team or leave me a comment or a tweet.

*** April 4: Updated the blog post with links for all fix levels ***

*** April 10: The IBM Web Alert has been updated with new information on what to do if your uptime has actually gone past 208 days without a reboot. In short you still need to take action. Please read the updated alert and follow the instructions given there. ***

About Anthony Vandewerdt

I am an IT Professional who lives and works in Melbourne Australia. This blog is totally my own work. It does not represent the views of any corporation. Constructive and useful comments are very very welcome.

View all posts by Anthony Vandewerdt →

8 Responses to 208 day reboot bug

Pingback: 208 day reboot bug « Storage CH Blog
Ronny Steiner says:

March 26, 2012 at 8:01 pm

Hi Anthony,

do you know if there is a way to check uptime with more details like hours and minutes? I would like to know if the last reboot between the canister nodes is within minutes or hours. Then I’d be able to determine if I hit a full or partial outage of the whole system.

- Anthony Vandewerdt says:
  
  March 26, 2012 at 8:45 pm
  
  Hi Ronny
  
  Great question.
  What you need to do is download a support package from the machine.
  Go to Settings –> Support and choose ‘download support package’
  The machine will create an SVC_SNAP which gets downloaded as a TGZ file.
  That file will contain the MCE log file from each node canister.
  Expand the TGZ file and look in DUMPS –> SYSLOG folder (if you have Windows, download WINRAR to do this).
  Look for the MCELOG log file for each node.
  In that file you can see lines like:
  
  mcelog.xxxxxxx-1
  Sat Jan 28 16:49:56 EST 2012 booting node
  
  mcelog.xxxxxxx-2
  Sat Jan 28 16:06:30 EST 2012 booting node
  
  So on my machine I have 43 minute interval…
  
Ronny says:

March 26, 2012 at 9:29 pm

thanks Anthony, that’s really good to know!

perthitblog says:

March 27, 2012 at 10:28 am

Hi Anthony,

We are currently running V7000 with v6.2.0.3 and have an upgrade window on Thursday night. Our plan was to go to 6.3.0.1 but should i just go to v6.2.0.5 and wait for the fix for 6.3.

Garrick.

- Anthony Vandewerdt says:
  
  March 27, 2012 at 4:25 pm
  
  If you are NOT using VMware SRM, then going to V6.3 gives you the LDAP, GM change volumes, improved perf GUI, multi-session iSCSI, etc.
  But…. I would want you to go to a higher 6.3.0.x version within 208 days after upgrading.
  
  Going to 6.2.0.5 is a safer maintenance choice, if the whole point is to just do a maintenance style upgrade but you would eventually want to go to 6.3.0.x in a few months anyway.
  
  So ironically either choice is going to mean a code update now and a code update later. My preference? Go to 6.2.0.5 and then do 6.3.0.x in a few months time. FYI, The 6.3.0.x version with the fix should be out within the next 10 days.
  
Oliver Kilk says:

April 17, 2012 at 9:21 pm

Hey Anthony,

Thank your for the great blog.
I do have a general question about V7000. I’m running a V7000 Version 6.3.0.1 (build 54.6.1201270000). I also have a DS4500 and DS4700, can I make V7000 as a SVC for the given external storage systems? When I created a host V7000 to external systems as a port IBM TS SAN VCE, I keep getting the: “Error Code 1624” “Event ID Text : Controller configuration has unsupported RDAC mode”.
Any ideas?

- Anthony Vandewerdt says:
  
  April 18, 2012 at 9:00 am
  
  Hi Oliver.
  
  The Storwize V7000 can absolutely act like an SVC in front of a DS4500 and DS4700.
  The unsupported RDAC mode sounds suspicious.
  I would check that your machines are on the right firmware levels.
  Check this table to confirm:
  
  http://www-01.ibm.com/support/docview.wss?uid=ssg1S1003908#_DS4K