Monitoring IBM Storwize and IBM SVC products with Splunk

I have been playing around with Splunk recently, so I can understand what it is and why my customers may choose to it.   For those that don’t know, Splunk (the product) captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.  In essence Splunk is a really cool and smart way to look at and analyse your data.

Because Splunk is able to ingest data from almost any source we can quite easily start pulling data out of an IBM Storwize or SVC product and then investigate with Splunk.  I couldn’t find anything in Google on this subject, so here is a post that will help you along.

A common way to get data into Splunk is to use syslog.   Since Storwize can send events to syslog, all we need to do on the Storwize side is configure where the Splunk server is.

In this example I have chosen syslog level 7 (which is detailed output) and to send all events.

2014-10-26_11-43-19

Then on Splunk side, ensure Splunk is listening for syslog events.   Storwize always uses UDP port 514:

2014-10-26_11-44-59

However this really only captures events.   There are lots of other pieces of information we may want to pull out of our Storwize products and graph in Splunk.   So lets teach Splunk how to get them using CLI over SSH.

Firstly we need to supply Splunk a user ID so it can login to our Storwize and grab data.   I created a new user on my Storwize V3700 called Splunk, placed it in the Monitor group (so anyone with the Splunk userid and password can look but not touch) and then supplied a public SSH key since I don’t want to store a password in any text file and using SSH keys makes things nice and easy.  In this case I am using the id_rsa.pub file for the root user of my Splunk server, since in my case Splunk is running all scripts as root.

2014-10-26_12-59-07

Now from my root command prompt on the Splunk server  (called av-linux) I test that access works to my V3700 (on IP address 172.24.1.121) using the lsmdiskgrp command.   It’s all looking good.

[root@av-linux ~]# ssh splunk@172.24.1.121 "lsmdiskgrp -delim ,"
id,name,status,mdisk_count,vdisk_count,capacity,extent_size,free_capacity,virtual_capacity,used_capacity,real_capacity,overallocation,warning,easy_tier,easy_tier_status,compression_active,compression_virtual_capacity,compression_compressed_capacity,compression_uncompressed_capacity 
0,InternalPool1,online,1,5,32.55TB,2048,27.06TB,5.49TB,5.49TB,5.49TB,16,80,auto,balanced,no,0.00MB,0.00MB,0.00MB

So I am now set up to write scripts that Splunk can fire on a regular basis to pull data from my Storwize device using SSH CLI commands.

Now here are two important things to realize about using SSH commands to pull data from Storwize and ingest them into Splunk:

  1. For historical data like logs, it is very easy to pull the same data twice.  For instance if I grab the contents of the lseventlog command using an SSH script then I will get every event in the log, which is fine.   But if I grab it again the next day, most of the same events will be ingested.   If I am looking to validate how often a particular event occurs I will count the same event many times as I ingested it many times.   Ideally the Storwize CLI commands would let me filter on dates, but that functionality is not available
  2. Real time display commands don’t insert a date into the output, but Splunk will log the date and time that each piece of data was collected on.

Lets take the output of lsmdiskgrp as shown above.   If we run this once per day we could track the space consumption of each pool over time.   Sounds good right?   So on my Splunk server I create a script like this.  Notice I get the output in bytes, this is important as the default output could be in MB or GB or TB.

ssh splunk@172.24.1.121 “lsmdiskgrp -delim , -bytes”

I put the script into the /opt/splunk/bin/scripts folder and call it v37001pools.

I make it executable and give it a test run:

[root@av-linux scripts]# pwd
/opt/splunk/bin/scripts
[root@av-linux scripts]# chmod 755 v37001pools
[root@av-linux scripts]# ./v37001pools
id,name,status,mdisk_count,vdisk_count,capacity,extent_size,free_capacity,virtual_capacity,used_capacity,real_capacity,overallocation,warning,easy_tier,easy_tier_status,compression_active,compression_virtual_capacity,compression_compressed_capacity,compression_uncompressed_capacity 
0,InternalPool1,online,1,5,35787814993920,2048,29753385943040,6034429050880,6034429050880,6034429050880,16,80,auto,balanced,no,0,0,0

So now I tell Splunk I have a new input using a script:

2014-10-26_13-32-51

Input the location of the script, the interval and the fact that this is CSV (because we are using -delim with a comma.  Note my interval is crazy:   every 60 seconds is way too often, even every 3600 seconds is probably too often.  I used it to get lots of samples quickly.

2014-10-29_11-08-13

I now confirm I have new data I can search:

2014-10-29_12-45-33

And the data itself is time stamped with all fields identified and has all the data like pool names.
2014-10-29_12-46-08

Now I can start graphing this data.   With Splunk what I find is that if someone publishes the XML this makes life way easier.    So I created an empty Dashboard called Storwize Pools and then immediately select Edit Source

2014-10-26_13-39-49

Now replace the default source (delete any text already in the source) with this where you change the heading and script name with your own (in red) and the pool name of one of your pools (in blue).  If you have more than one pool, add an additional chart for every pool (copy all the chart section and just make a new chart).


In the attached word document you will find the required XML.   For some reason WordPress kept fighting me and changing my quotes so I have attached the XML as a doc.

SplunkDashboard


And we get a lovely Dashboard that looks like this.  Because the script runs every 60 seconds, I am getting 60 second stats.

2014-10-29_12-48-48

We could run it every day or use a cron job to run it at the same time of every day (which makes more sense).   Maybe once per day at 1am by setting the interval to a cron value like this:   0 01 * * *

2014-10-29_12-49-45

So hopefully that will help you get started with monitoring your SVC or Storwize product with Splunk.

If you would like some more examples, just leave a comment!

Advertisements

About Anthony Vandewerdt

I am an IT Professional who lives and works in Melbourne Australia. This blog is totally my own work. It does not represent the views of any corporation. Constructive and useful comments are very very welcome.
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Monitoring IBM Storwize and IBM SVC products with Splunk

  1. Gary Diggs says:

    Hi Anthony,
    Great article!!!
    Do you have anything with syslog and XIV?

  2. Nice! Any plans to make a splunk app out of this?

  3. Shubham Soin says:

    Hi Anthony, Thanks for the article. Could you please suggest some more examples for detailed monitoring of the SAN boxes. That would be of great help!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s