497 – the number of the IT beast

I recently had a client ask me if I had seen a problem in Cisco Device manager.  Device Manager was showing them 100% utilisation for CPU on one of their MDS9509s.   I had a look at the show tech-support and curiously show process cpu showed practically no CPU usage at all.   I suggested a display problem and sure enough, Cisco confirmed it:

CSCte81951
Symptom: The show system resources command shows high CPU usage even when there is not much activity on the switch. In one instance, the CPU utility (user and kernel) was always 100 percent.
Conditions: You might see this symptom 248 days after the system came up

Curiously the Cisco tech support person stated that in fact a CP switchover every 497 days would prevent the issue reoccurring.   This is curious because 248 days is close to half of 497 days.   And 497 is the IT number of the beast.

The reason that 497  is a problem number is because of the use of a 32 bit counter to record uptime.   If you record a tick for every 10 msec of uptime, then a 32-bit counter will overflow after approximately 497.1 days.  This is because a 32 bit counter equates to 2^32, which can count 4,294,967,296 ticks.  Because a tick is counted every 10 msec, we create 8,640,000 ticks per day (100*60*60*24).  So after 497.102696 days, the counter will overflow.   What happens next depends on good programming.

Some classic bugs can be found here, here, here and here.  Most of these bugs are old and will almost certainly not affect anybody.  But remain on notice:  497 day bugs are still possible.   Just Google the search argument: 497.1 day bug.  

Now let me be clear:  I am not aware of any active disruptive, bring-down-your-business type 497 day bugs.  The sky is not falling. But historically many vendors products have had 497 day bugs, some of them nasty.  I ponder whether we should schedule a switch reboot every 496 days just to avoid the possibility of a 497 day bug.   Its an interesting idea.   I certainly endorse staggering initial switch reboots by at least an hour, so that a simultaneous 497 day reboot bug (should one be lurking), would not reboot every switch in every fabric at the same time.    And in case your think I am picking on Cisco, when I looked at the client switch in question, it was showing a kernel uptime of 562 days, 23 hours, 35 minutes, 24 seconds.  Thats some solid uptime.

About these ads

About Anthony Vandewerdt

I am an IT Professional who lives and works in Melbourne Australia. This blog is totally my own work. It does not represent the views of any corporation. Constructive and useful comments are very very welcome.
This entry was posted in Uncategorized and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s