How can I test Storwize V7000 Node Canister failure?

I have received this question several times, so it’s clearly something people are interested in.

The Storwize V7000 has two controllers known as node canisters. It’s an active/active storage controller, in that both node canisters are processing I/O at any time and any volume can be happily accessed via either node canister.

The question then gets asked: what happens if a node canister fails and can I test this? The answer to the question of failure is that the second node canister will handle all the I/O on its own.  Your host multipathing driver will switch to the remaining paths and life will go on.  We know this works because doing a firmware upgrade takes one node canister offline at a time, so if you have already done a firmware update, then you have already tested node canister fail over.  But what if you want to test this discretely? There are four ways:

  1. Walk up to the machine and physically pull out a node canister. This is a bit extreme and is NOT recommended.
  2. Power off a node canister using the CLI (using the satask stopnode command).  This will work for the purposes of testing node failure, but the only way to power on the node canister is to pull it out and reinsert it.  This is again a bit extreme and is not recommended.  This is also different to an SVC, since each SVC has it’s own power on/off button.
  3. Use the CLI to remove one node from the I/O group (using the svctask rmnode command). This works on an SVC because the nodes are physically separate.  On a Storwize V7000 the nodes live in the same enclosure and a candidate node will immediately be added back to the cluster, so as a test this is not that helpful.
  4. Place one node into service state and leave it there will you check all your hosts. This is my recommended method.

First up this test assumes there is NOTHING else wrong with your Storwize V7000.  We are not testing multiple failure here.   You need to confirm the Recommend Actions panel as shown below, contains no items.   If there are errors listed, fix them first.

Once we are certain our Storwize V7000 is clean and ready for test, we need to connect via the Service Assistant Web GUI.  If you have not set up access to the service assistant, please read this blog post first.

So what’s the process?

Firstly logon to the service assistant on node 1 and place node 2 into service state.   I chose node 2 because normally node 1 is the configuration node (the node that owns the cluster IP address).   You need to confirm your connected to node 1 (check at top right) and select node 2 (from the Change Node menu) and then choose to Enter Service State from the drop down and hit GO.

You will get this message confirming your placing node 2 into service state. If it looks correct, select OK.

The GUI will pause on this screen for a short period.  Wait for the OK button to un-grey.

You will eventually get to this with Node 1 Active and Node 2 in Service.

Node 2 is now offline. Go and confirm that everything is working as desired on your hosts (half your paths will be offline but your hosts should still be able to access the Storwize V7000 via the other node canister).

When your host checking is complete, you can use the same drop down to Exit Service State on node2 and select GO.

You will get a pop up window to confirm your selection.   If the window looks correct, select OK.

You will get the following panel.   You will need to wait for the OK button to become available (to un-grey).

Provided both nodes now show as Active, your test is now complete.

Advertisements

About Anthony Vandewerdt

I am an IT Professional who lives and works in Melbourne Australia. This blog is totally my own work. It does not represent the views of any corporation. Constructive and useful comments are very very welcome.
This entry was posted in Storwize V7000 and tagged . Bookmark the permalink.

13 Responses to How can I test Storwize V7000 Node Canister failure?

  1. Pingback: How can I test Storwize V7000 Node Canister failure? - The Business of IT Storage Blog - IBM Storage Community

  2. Pingback: How can I test Storwize V7000 Node Canister failure? | Storage CH Blog

  3. sunwoo kim says:

    i want to know whether v7000 write cache funtion will be disable when one canister get down.
    cache disable will make performance slow.
    please reply me.

  4. Willy Schriemer says:

    Hi Anthony, as a Oracle DBA I’am a newbie in the IBM V7000 storage community. Currently I’am creating and testing database “backup/restore” scripts using the IBM flash copy technology.

    The issue I have is that after a canister failure I have to change manually ALL the flash copy manager profiles to let it point to the second canister. Isn’t there any elegant manner on a Storwize box such as use of an ALIAS or PRIMARY/SECONDARY mechanism so I won’t have to change all the flash copy profiles on the local Oracle instances in case a canister fails ?

    TIA,

    Willy Schriemer.

    • Hi Willy, this behaviour is a new one on me, but I do not have much experience with Tivoli FlashCopy Manager. Is that what you are having issues with?

      • Willy Schriemer says:

        Hello Antony,

        During the configuration of FCM you have to give in the IP or name of the canister FCM connect to. So in case the canister fails over to the second one you need to adjust the FCM configuration for the new canister name.

        As far as I know there isn’t any intelligence in FCM such as an ALIAS or SECONDARY canister. The way I solved it now is to ping the canaisters in the FCM shell script and use the correct FCM configuration file depending on the outcome of the ping tests.

        TIA,

        Willy.

  5. Evelyn says:

    Why don’t you use the Cluster (also called System) IP address? The Cluster IP address is a floating system wide IP that follows the config node around. On Node failover, the Cluster IP will move to the next node in a prioritised list.

    It is only the service IP that is bound to the particular node hardware.

  6. KG says:

    you can shut canister ports from SAN switch.

  7. Nacima says:

    Hello,

    Please I have a problem on v3700 where one node is in state starting with error 550 and the second is in state service with error code 467 or ( 764) i don’t relever well.

    How van resolv my problem.

    Thank you for your hélas.
    Best regards.

  8. Akm says:

    Please I have issued like this that controller no. 1 failed and not response even In service tool page it doesn’t appear so I restart it and when it finished restarting it appears but with out node name only storage serial no. And it looks working in service state with errors code
    509 . 562 . 578.
    So can I take it of from service state to join it to cluster again and this won’t effect on seconds node which is active now ? What about this codes leads is there HW failure needed to change controller

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s