Monitoring IBM Storwize and IBM SVC products with Splunk

I have been playing around with Splunk recently, so I can understand what it is and why my customers may choose to it.   For those that don’t know, Splunk (the product) captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations.  In essence Splunk is a really cool and smart way to look at and analyse your data.

Because Splunk is able to ingest data from almost any source we can quite easily start pulling data out of an IBM Storwize or SVC product and then investigate with Splunk.  I couldn’t find anything in Google on this subject, so here is a post that will help you along.

A common way to get data into Splunk is to use syslog.   Since Storwize can send events to syslog, all we need to do on the Storwize side is configure where the Splunk server is.

In this example I have chosen syslog level 7 (which is detailed output) and to send all events.

2014-10-26_11-43-19

Then on Splunk side, ensure Splunk is listening for syslog events.   Storwize always uses UDP port 514:

2014-10-26_11-44-59

However this really only captures events.   There are lots of other pieces of information we may want to pull out of our Storwize products and graph in Splunk.   So lets teach Splunk how to get them using CLI over SSH.

Firstly we need to supply Splunk a user ID so it can login to our Storwize and grab data.   I created a new user on my Storwize V3700 called Splunk, placed it in the Monitor group (so anyone with the Splunk userid and password can look but not touch) and then supplied a public SSH key since I don’t want to store a password in any text file and using SSH keys makes things nice and easy.  In this case I am using the id_rsa.pub file for the root user of my Splunk server, since in my case Splunk is running all scripts as root.

2014-10-26_12-59-07

Now from my root command prompt on the Splunk server  (called av-linux) I test that access works to my V3700 (on IP address 172.24.1.121) using the lsmdiskgrp command.   It’s all looking good.

[root@av-linux ~]# ssh splunk@172.24.1.121 "lsmdiskgrp -delim ,"
id,name,status,mdisk_count,vdisk_count,capacity,extent_size,free_capacity,virtual_capacity,used_capacity,real_capacity,overallocation,warning,easy_tier,easy_tier_status,compression_active,compression_virtual_capacity,compression_compressed_capacity,compression_uncompressed_capacity 
0,InternalPool1,online,1,5,32.55TB,2048,27.06TB,5.49TB,5.49TB,5.49TB,16,80,auto,balanced,no,0.00MB,0.00MB,0.00MB

So I am now set up to write scripts that Splunk can fire on a regular basis to pull data from my Storwize device using SSH CLI commands.

Now here are two important things to realize about using SSH commands to pull data from Storwize and ingest them into Splunk:

  1. For historical data like logs, it is very easy to pull the same data twice.  For instance if I grab the contents of the lseventlog command using an SSH script then I will get every event in the log, which is fine.   But if I grab it again the next day, most of the same events will be ingested.   If I am looking to validate how often a particular event occurs I will count the same event many times as I ingested it many times.   Ideally the Storwize CLI commands would let me filter on dates, but that functionality is not available
  2. Real time display commands don’t insert a date into the output, but Splunk will log the date and time that each piece of data was collected on.

Lets take the output of lsmdiskgrp as shown above.   If we run this once per day we could track the space consumption of each pool over time.   Sounds good right?   So on my Splunk server I create a script like this.  Notice I get the output in bytes, this is important as the default output could be in MB or GB or TB.

ssh splunk@172.24.1.121 “lsmdiskgrp -delim , -bytes”

I put the script into the /opt/splunk/bin/scripts folder and call it v37001pools.

I make it executable and give it a test run:

[root@av-linux scripts]# pwd
/opt/splunk/bin/scripts
[root@av-linux scripts]# chmod 755 v37001pools
[root@av-linux scripts]# ./v37001pools
id,name,status,mdisk_count,vdisk_count,capacity,extent_size,free_capacity,virtual_capacity,used_capacity,real_capacity,overallocation,warning,easy_tier,easy_tier_status,compression_active,compression_virtual_capacity,compression_compressed_capacity,compression_uncompressed_capacity 
0,InternalPool1,online,1,5,35787814993920,2048,29753385943040,6034429050880,6034429050880,6034429050880,16,80,auto,balanced,no,0,0,0

So now I tell Splunk I have a new input using a script:

2014-10-26_13-32-51

Input the location of the script, the interval and the fact that this is CSV (because we are using -delim with a comma.  Note my interval is crazy:   every 60 seconds is way too often, even every 3600 seconds is probably too often.  I used it to get lots of samples quickly.

2014-10-29_11-08-13

I now confirm I have new data I can search:

2014-10-29_12-45-33

And the data itself is time stamped with all fields identified and has all the data like pool names.
2014-10-29_12-46-08

 

Now I can start graphing this data.   With Splunk what I find is that if someone publishes the XML this makes life way easier.    So I created an empty Dashboard called Storwize Pools and then immediately select Edit Source

2014-10-26_13-39-49

Now replace the default source (delete any text already in the source) with this where you change the heading and script name with your own (in red) and the pool name of one of your pools (in blue).  If you have more than one pool, add an additional chart for every pool (copy all the chart section and just make a new chart).


 

<dashboard>
<label>Storwize Pool Stats</label>
<description/>
<row>
<panel>
<chart>
<title>InternalPool1</title>
source=”/opt/splunk/bin/scripts/v37001poolsInternalPool1 | eval usedgb=((used_capacity/1024)/1024/1024) | eval capgb=((capacity/1024)/1024/1024) | timechart max(usedgb) as UsedSpace(GB),max(capgb) as MaxSpace(GB)
<earliestTime>0</earliestTime>
<latestTime/>
<option name=”charting.axisLabelsX.majorLabelStyle.overflowMode”>ellipsisNone</option>
<option name=”charting.axisLabelsX.majorLabelStyle.rotation”>0</option>
<option name=”charting.axisTitleX.visibility”>visible</option>
<option name=”charting.axisTitleY.visibility”>visible</option>
<option name=”charting.axisTitleY2.visibility”>visible</option>
<option name=”charting.axisX.scale”>linear</option>
<option name=”charting.axisY.scale”>linear</option>
<option name=”charting.axisY2.enabled”>false</option>
<option name=”charting.axisY2.scale”>inherit</option>
<option name=”charting.chart”>line</option>
<option name=”charting.chart.nullValueMode”>connect</option>
<option name=”charting.chart.sliceCollapsingThreshold”>0.01</option>
<option name=”charting.chart.stackMode”>default</option>
<option name=”charting.chart.style”>shiny</option>
<option name=”charting.drilldown”>none</option>
<option name=”charting.layout.splitSeries”>0</option>
<option name=”charting.legend.labelStyle.overflowMode”>ellipsisEnd</option>
<option name=”charting.legend.placement”>right</option>
<option name=”charting.axisTitleX.text”>Date</option>
<option name=”charting.axisTitleY.text”>CapacityGB</option>
</chart>
</panel>
</row>
</dashboard>


 

And we get a lovely Dashboard that looks like this.  Because the script runs every 60 seconds, I am getting 60 second stats.

2014-10-29_12-48-48

We could run it every day or use a cron job to run it at the same time of every day (which makes more sense).   Maybe once per day at 1am by setting the interval to a cron value like this:   0 01 * * *

2014-10-29_12-49-45

So hopefully that will help you get started with monitoring your SVC or Storwize product with Splunk.

If you would like some more examples, just leave a comment!

Posted in Uncategorized | Leave a comment

The quest for perfect knowledge doesn’t start with a screen capture

One of the fun parts of my job is problem solving….  I wont lie… I love it.

Step one in problem solving is always the same:  define the problem.
Step two:  get the data needed to solve the problem.
Step three:  solve it!

Simple, right?

Wrong.

One of the reasons IT gets it wrong again and again is simple:  the assumption of perfect knowledge.   We assume that with one sentence or even worse, one screen capture, we have described the problem with enough depth that it can now be solved.   That the team now perfectly understand the problem and that the solution they supply will be…. wait for it….  you guessed it….  perfect!

2014-10-04_19-07-21

Don’t get me wrong, I love screen captures (using my favourite tool, Snagit).   In fact screen captures are one of my number one tools for writing documentation.  When I worked on IBM Redbooks (one of IBMs greatest free gifts to the IT community) I often found some chapters were more picture than text… and that was ok.   People need to see what it is you are talking about.

But when it comes to describing a problem, in the vein of a picture is worth a thousand words, a screen capture can be the devil itself.   The issue with screen captures is simple:   they contain information that cannot be easily searched or indexed (apart from with your eyeball).   They may show the problem or just barely validate that the problem exists, but they rarely help in SOLVING the problem.

Last week I got my favourite kind of screen capture, the one taken of a screen with a phone (with the reflection of the photographer clearly visible in the shot).   Apart from giving me the ability to rate that person’s fashion sense, these kinds of shots are among the worst.   Amusingly when I asked why I didn’t also get logs, I was told the customers security standards would not allows logs to be sent.   Yeah right… this is the same customer who doesn’t mind you standing in the middle of their computer room taking photos of their displays with your phone?

2014-10-12_21-03-20

So the next time you plan on sending a screen capture, stop for a minute and consider…  is this enough for a perfect solution?   Are there no logs I can send along with this picture?  Has the vendor supplied a tool I can use to offload data?   Or even better automatically send it?    Am I doing anything more than just describing the problem itself?

Posted in advice | 3 Comments

Shellshock and IBM SVC and Storwize products

While blogging last week about how various vendors have responded to the Shellshock exploit, I noted that several vendors, notably Oracle and Cisco were open about products that they did not yet have a fix for.     IBM meanwhile appears to be only announcing vulnerability after they have the fix.   In other words, vulnerable customers are left without formal notification that they are exposed, or made aware of any workarounds, until a fix is actually available.   I am left slightly annoyed by this policy.

MrPotatoHead_11The formal notification for the Storwize family and IBM SVC family came out here on October 11, 2014.  At time of writing these are the fix levels:

Remediation/Fixes
IBM recommends that you fix this vulnerability by upgrading affected versions of IBM SAN Volume Controller, IBM Storwize V7000, V5000, V3700 and V3500 to the following code levels or higher:

7.1.0.11
7.2.0.9
7.3.0.7

More importantly it contains this critical piece of information:

Vulnerability Details

The following vulnerabilities are only exploitable by users who already have authenticated access to the system.

In other words, the best way to manage exposure is to limit the number of users who have CLI access and to use network restrictions (such as ACLs and Firewalls) to restrict network access to your devices.

So kudos to IBM for creating fixed versions, I just wish that acknowledgement and remediation advice could have been published earlier.

 

Posted in Uncategorized | Tagged , , , , | Leave a comment

Vale Randall Davis

I received some very sad news last week that Randall Davis has passed away.

Randall was a very experienced and capable IT professional based in Melbourne Australia. He worked for IBM for many years;  co-authored several IBM Redbooks and fathered two wonderful boys with his wife Fiona.

Randalls funeral will be held in the Federation Chapel, Lilydale Memorial Park, 126-128 Victoria Rd, Lilydale on Wednesday Oct. 8, 2014, commencing at 11.15 am.

If you knew Randall and wish to pay your respects, then please attend.

Posted in Uncategorized | 2 Comments

Shell shocked by binary explanations

On September 24, 2014 some new exploits to gain unauthorized access to Unix based systems that have a bash shell were revealed.   Known collectively as shellshock it has caused tremendous consternation and activity in the IT industry.

What has proven interesting is the way each major vendor has chosen to respond to this issue. An enormous number of products, whether software, hardware or appliance, are affected.  You could almost safely assume that if a product can be accessed with a Unix like shell, then it is quite likely going to need patching, once the relevant vendor has released a fix.

But how can you know?

The best way is clearly if the vendor in question has released a statement and this is where things get interesting.    Some vendors have taken the attitude that when they have a fix, they will admit they have the vulnerability.

Ideally each vendor should post a list of:

  • Products that are not vulnerable
  • Products that are vulnerable but a fix is available
  • Products that are vulnerable and no fix is available (yet)
  • Products that may be vulnerable but testing is still in progress

This IBM website here happily lists unaffected products, but gives no guidance as to affected products.  You can see a screen capture below of the start of the unaffected product list.

2014-10-04_19-37-38

The DS8000 has a page here detailing available fixes, but its stable mate the Storwize V7000  (and V3700 and V5000 and the SVC) are also almost certainly affected, but not a peep on the internet from IBM about them.    I presume because a fix is being written but is not yet available

Oracle have a great page here which has four sections with titles like:

  • 1.0 Oracle products that are likely vulnerable to CVE-2014-7169 and have fixes currently available
  • 2.0 Oracle products that are likely vulnerable to CVE-2014-7169 but for which no fixes are yet available
  • 3.0 Products That Do Not Include Bash
  • 4.0 Products under investigation for use of Bash

Cisco have a great page here with a very similar set of information with sections like:

  • Affected Products
  • Vulnerable Products
  • Products Confirmed Not Vulnerable

EMC have a page here but as usual, EMC make it hard for us common people by putting it behind an authentication wall.

Posted in advice, Uncategorized | Tagged , | 1 Comment

Actifio Copy Data Forum – September 17 in Sydney

Its been a few weeks between posts, with the simple reason that I have been quite busy at Actifio!   One thing that is keeping me busy is an event we are running up in Sydney next week and it would be great to see you there.

You will hear from leading organizations like Westpac NZ, NSW Ambulance Service, and other Actifio customers, and learn how they have transformed their data management with Actifio Copy Data Virtualisation solutions.

At Copy Data Forum 2014, you will learn more about the proven business impact of Actifio, including:

Improved Resiliency – through instant data access for data protection and disaster recovery.

Enhanced Agility- putting data where you need it and when you need it.

Transition to the Cloud – ensuring your data follows your applications wherever they live, including public, private, and hybrid cloud-based systems.

Dramatic Savings – up to 90% reduction in storage costs, and up to 70% reduction in network bandwidth.

Sound interesting?     Register Now

Posted in Uncategorized | Leave a comment

Tortured by Tenders? Whats the problem?

Many Australian organizations, both Government and Private Enterprise acquire IT technology through a tender process.

No not that kind of tender….

Love Me Tenders

More like this kind of tender (anyone want a bridge shaped like a coat-hanger?).

tender-e-plan-5

The process of creating and responding to a tender actually involves what I call the three tortures:

  • The torture of creating the tender request document.   I have never met a client who enjoys the creation process.   Many resort to paying third parties to create them
  • The torture of responding to a tender.  I have never met a business partner or vendor who enjoys responding to one!
  • Then the final torture:  The torture of reading all those tender responses and selecting the winning one.   I have never personally experienced this torture but I can imagine how hard it must be reading all those vendor weasel words.

500px-Weasel_words.svg

I see five fundamental issues around tenders (well in Australia anyway).

1)  The lawyers are writing most of them (and it’s not helping one bit)

Saul

Most tender requests contain a huge amount of legal documentation.   Often less than 10% of all the words in the published documentation relates to technical or (more importantly) business requirements.
Quite seriously they often include 70-100 pages of legalese and 5-10 pages of truly useful back story as to why this tender has been released at all.
I am certain that every tender response needs to be stated inside a legal framework of responsibilities, but I have not seen any evidence that all of this legalese has prevented failed projects or bad solutions.

2)  Repetition in questions

I cannot over state how bad this situation is.   I have repeatedly seen tender documents that ask the same questions again and again and again (and again).

Even worse I see questions that are clearly not finished or questions that are missing huge amounts of obvious (and necessary) subtext or back story.  I have seen tenders where the quantity of question/answer documents created after the tender was published (as vendor questions are responded to) exceeded the quantity of technical detail provided in the initial documentation.  Quite frankly that’s just astonishing (in a bad way).

It seems that different teams each contribute to the total documentation and the person who compiles and publishes the document has no inkling just how much repetition has occurred in the process.   I don’t blame the authors – I blame the project manager who compiles their contributions and the timelines under which these tender documents are created.   Indeed my gut feel is that management simply don’t give the authors anywhere near enough time or resources to do a good job.

3)  Vendor bias

When you see a tender that asks for SRDF (as opposed to sync or async replication) you know there is a serious (EMC focused) bias.   Asking for Semi-Synch replication is nearly as bad (that marks it as a Netapp focused tender).
Many tenders are written with a specific outcome in mind, but all this leads to is weasel words, as all the other responders attempt to use their hammers to batter their products into the shape needed to answer the questions.

The issue is that the tender should really be about business outcomes enabled by IT, not IT solutions that someone thinks will lead to the best IT outcomes (and by implication, maybe, hopefully, the right business outcomes).

The idea of accepting that truly differentiated vendors will help you achieve better outcomes with differentiated technology simply doesn’t fit a straight question and answer response document.   The Q&A method only suits the accountants trying to score the responses.   But don’t worry all of that is handled next….

4)  No connection between technical requirements and financial reality

I have no issue with every organisation trying to get the best value for their money and the best possible outcomes for every major technical rework.
But if you want 2 Petabytes of Tier1 disk and your budget is $100K you are not going to get it.
Frankly most IT departments know full well what their maximum budget is, but if all but one tender response gets knocked back within 1 hour of submission because they all missed an unstated financial cut off, you have to question the efficacy of the whole process.   Invariably at least one vendor with the right contacts knows the ‘win price’. Everyone else was drifting off in the clouds.

Throw me a frickin bone

5)  May final gripe:   nowhere near enough lead time to get the responses written.   I routinely see tenders talked about for months and then released with less than 3 weeks to create the responses.  This tends to reflect the overall problem with timing…. everyone is simply too busy, but the end result is rushed bids.

So do you have a better perspective on what’s going on?  Feel free to share!

Posted in advice, Uncategorized | 3 Comments